MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
47 stars 22 forks source link

HPC mirroring: remember moved directories when retrying failed transfers. #613

Open petersilva opened 1 year ago

petersilva commented 1 year ago

when there are many copies in flight, and the order of operations is unclear, there will be times when the following sequence of events occurs at source:


echo hoho_content >dir1/hoho
mv dir1 therealdir

the first line will result in a notification message for the dir1/hoho file... the second line will result in a directory rename event.

If the consumer receives the notification after the second line has occurred, it will try to retrieve the file dir1/hoho from the source. The source file will no longer be there, because the directory has been renamed.

It would be good if the retry logic had a record of recent directory renames around, and could identify whether one of the directory renames was relevent, and so adjust the fetch url to reflect the new source.

Existing Test Case

We have an easy canned way to reproduce this issue. In the sarrac ( https://github.com/MetPX/sarrac ) repository, there is a shim_copy_post.sh script, which has a sleep before a directory is renamed to prevent this problem. Commenting out the sleep provokes the problem. so people can try things out easily. then do make test_shim_copy, and the test will fail with some helpful error messages.

petersilva commented 1 year ago

complexities:

petersilva commented 1 year ago
petersilva commented 1 year ago

option 1: subscribers "undo" directory rename on download/retry

implementation is all at the downstream side... (subscribers) with no change to posting side...

concerns/challenges:

Option 2: posters add a trailing symbolic link at the source

This is is entirely on the poster side. The poster just creates a link so that the original download path still works for a while a rename.

concerns/challenges:

Existing Test Case

In the C package, there is a shim_copy_post.sh script, which has a sleep before a directory is renamed to prevent this problem. Commenting out the sleep provokes the problem. so people can try things out easily.