Closed klimaj closed 2 months ago
PyRosetta unit tests have successfully passed, and this PR is ready for review
@klimaj just noticed that this is still not merged, - just a reminder in case it was forgotten. (no rush if you planning to add more code or something)
@lyskov Thanks for your review! Yes, this PR is ready to be merged (I was waiting as I am not an authorized user to merge, but please go ahead and merge if possible).
The aim of this PR is to fix two issues in PyRosettaCluster:
client.scatter(hash=False)
option, so that random keys are assigned to data rather than hashed from the scattered data. If workers are processing two identical tasks on the same machine, then this can lead to aKeyError
, sohash=False
prevents this. This should be the case anyway since user-provided PyRosetta protocols may be deterministic or stochastic (and we're already submitting them to the client as impure functions), and repeated tasks may have their own destiny (depending on choreography of the simulation) and be processed at different times. Overall, this update doesn't affect user-facing functionality, but it prevents errors from occurring if two tasks happen to stem from identical data, so it's considered a bug fix.max_delay_time
. A dask scheduler may get overwhelmed with large task graphs, many workers, and quickly executing tasks. Tasks run on workers asynchronously from the dask scheduler checking up on them, so a task may complete before the scheduler knows it is being processed, leading to a race condition. To address this scenario, users now have a safety throttle to delay returning results from a worker to the dask scheduler, allowing the scheduler to orchestrate the simulation properly. In practice, this option only takes effect in cases when the user returns a result immediately (e.g.,def protocol(packed_pose, **kwargs): return None
). The worker will delay the result for a default of 3 seconds (at least 1 second is recommended). In cases where user-provided PyRosetta protocols are running computation that takes longer than 3 seconds, then this update has no effect. Users can set this option to 0 seconds for existing functionality, but may encounter occasionalCommClosedError
s depending on a number of factors (e.g., existing network traffic and latency, task graph size, number of workers, compute resource utilization, etc.).