Sage-Bionetworks / synapsePythonClient

Programmatic interface to Synapse services for Python
https://www.synapse.org
Apache License 2.0
65 stars 67 forks source link

Large Dataset Download Frozen/Stuck with held Lock (Download Never Finishes) with `synapseutils.syncFromSynapse` #1110

Open tommypkeane-gehc opened 2 weeks ago

tommypkeane-gehc commented 2 weeks ago

Related to #1088 , I've been repeatedly running into issues where large downloads of lots of directories in a Dataset never complete.

The download progresses partway while using synapseutils.syncFromSynapse(...), and then freezes. When pressing CTRL+C in bash, it does nothing, then the second time it escapes a held lock, and then I have to press it again 1-2 more times, and it finally breaks all the nested locks and cancels the execution of Python.

This has happened over the past several weeks on multiple different datasets and is seemingly inconsistent with time of day or day of the week.

I'm not clear on any throttling or rate-limiting configurations on Synapse, so I don't know if it's related to some backend/server-side configuration that I can't see.

One thing I've consistently started to notice is that the default ~/.synapseCache/ directory will fill with folders up to 999 apparently per this non-exposed value for fanout:

https://github.com/Sage-Bionetworks/synapsePythonClient/blob/develop/synapseclient/core/cache.py#L94-L99

I'm going to attempt to post-hoc modify the self.cache.fanout value of the synapseclient.Synapse instance, to try and raise the value to 10_000 and see if any downloads can get further. When I clear the cache with rm -rf ~/.synapseCache/*, it does seem like I can start downloading again, but I also still don't know if some rate-limiting has timed-out/reset because this is usually discovered the next day after freezing at some point overnight or earlier in the day.

I haven't looked through all of the current code to confirm this suspicion, but based-on the CTRL+C tracebacks, it seems like the sync/download session is trying to obtain a new lock or has obtained a lock and is stuck holding it, and doesn't release the lock.

Again, take this as just a guess, but I was wondering if there's a bug in the cache map logic whereby once the fanout maximum is reached, the modulo operation reuses one of the previously used cache directories, but if that directory is still also being used by a held lock, then a second lock is obtained on-it/related-to-it, you're going to end-up with two locks, one tries to release, and the other is held, but then at some point the logic may not support the second release for the remaining lock because the cache map already "thinks" that it was freed?

I'll look through the code more to see if I see any particular logic errors, and I'll update this issue if I notice anything when modifying the fanout value.

If anyone has any other insights or suggestions, it'd be greatly appreciated.

There's a small handful of really large datasets that have been shared with our team, and this issue means that there's no clear programmatic way to run the downloads, and it's become a really manual process. I have to always cancel out of every download session manually with CTRL+C, and then I also need to double-check the directory/item counts, because there's no indication as to whether the download process froze on its final lock release after all downloads completed, or if it got stuck somewhere before the end.

This is happening with synapseclient version 4.3.0 in Ubuntu 20.04.1 on x86_64 with Linux 5.15.0-1060-gcp:

❯ cat ~/path/to/my/poetry.lock | grep synapse
name = "synapseclient"
    {file = "synapseclient-4.3.0-py3-none-any.whl", hash = "sha256:5d8107cfff4031a0a46d60a3c9a8120300190fa27df4983d883dc951d8bd885f"},
    {file = "synapseclient-4.3.0.tar.gz", hash = "sha256:a1149a64b3281669d42c69e210677a902478b8f6b302966d518473c7384f6387"},

but this also was similarly happening in v4.2.0 several weeks ago, and hasn't improved. Let me know if any other information is needed.

BryanFauble commented 2 weeks ago

@tommypkeane-gehc thanks for the report.

I am in the middle of working to rewrite the download algorithm. I've already rewritten the core logic, but now I need to focus on patching up syncFromSynapse.

Once I re-write this logic I'll be running through some download Benchmarking to compare before/after. I don't anticipate we are going to get a performance gain out of this, but, I do anticipate it'll eliminate or reduce issues like your reporting here.

These changes are not ready yet, but incase you are curious they are here: https://github.com/Sage-Bionetworks/synapsePythonClient/pull/1107 - Although syncFromSynapse is not production ready, feel free to checkout the code from this branch and use syn.get() for single file download. Feedback on the changes I make are always super important to me.

I'll dig more into it, but I'm not sure the cache fanout is the issue here. I suspect the issues are more so how the code was handling the synchronization of multiple-threads downloading data combined with an md5 calculation after file download. I'll think through the best mechanism to report back to the console logs when a file is in the "download" or "md5 calculation" phase, that way we are never in a situation of not knowing what is currently occurring.

tommypkeane-gehc commented 2 weeks ago

Thanks for clarifying and replying, @BryanFauble !

I'll dig more into it, but I'm not sure the cache fanout is the issue here. I suspect the issues are more so how the code was handling the synchronization of multiple-threads downloading data combined with an md5 calculation after file download. I'll think through the best mechanism to report back to the console logs when a file is in the "download" or "md5 calculation" phase, that way we are never in a situation of not knowing what is currently occurring.

I'm trying the fanout update and I'll add any new info I have to this Issue, if it actually makes any difference, but your explanation makes sense and I just haven't looked through enough of the code yet to see any clear issues. Thanks!

BryanFauble commented 1 week ago

@tommypkeane-gehc To give you an update. I have all the logic re-written and am getting everything reviewed, tested, and released. The release is not official yet, however, if this is something you're still running into problems with you may install the branch will all these changes via: pip install git+https://github.com/Sage-Bionetworks/synapsePythonClient@SYNPY-1476-remove-thread-executor-usage