Open petersilva opened 1 year ago
complexities:
with winnowing, the process that receives the directory rename may not be the same one that received the file message. If the cache is per instance, no guarantee the same instance will see the rename message AND the corresponding download.
renames are done locally (on the download side.) but need to understand how to modify the upstream url to it's pre-rename value, which is never used in the normal renames... need to figure out the retrieval url that would correspond to undoing a rename.
implementation is all at the downstream side... (subscribers) with no change to posting side...
concerns/challenges:
This is is entirely on the poster side. The poster just creates a link so that the original download path still works for a while a rename.
concerns/challenges:
In the C package, there is a shim_copy_post.sh script, which has a sleep before a directory is renamed to prevent this problem. Commenting out the sleep provokes the problem. so people can try things out easily.
when there are many copies in flight, and the order of operations is unclear, there will be times when the following sequence of events occurs at source:
the first line will result in a notification message for the dir1/hoho file... the second line will result in a directory rename event.
If the consumer receives the notification after the second line has occurred, it will try to retrieve the file dir1/hoho from the source. The source file will no longer be there, because the directory has been renamed.
It would be good if the retry logic had a record of recent directory renames around, and could identify whether one of the directory renames was relevent, and so adjust the fetch url to reflect the new source.
Existing Test Case
We have an easy canned way to reproduce this issue. In the sarrac ( https://github.com/MetPX/sarrac ) repository, there is a shim_copy_post.sh script, which has a sleep before a directory is renamed to prevent this problem. Commenting out the sleep provokes the problem. so people can try things out easily. then do make test_shim_copy, and the test will fail with some helpful error messages.