DistributedScience / Distributed-CellProfiler

Run encapsulated docker containers with CellProfiler in the Amazon Web Services infrastructure.
https://distributedscience.github.io/Distributed-CellProfiler/
Other
37 stars 24 forks source link

== ERR move failed #123

Closed ErinWeisbart closed 1 year ago

ErinWeisbart commented 3 years ago

In Cloudwatch we see: == ERR move failed: local_output/PLATE/SegmentationCheck_Image.csv to s3://BUCKET/projects/PROJECT/BATCH/images_segmentation/PLATE/SegmentationCheck_Image.csv An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.

We don't see the error immediately but it instead becomes more and more prevalent in our logs with time. We believe this is happening because we have too many machines finishing too many jobs very quickly. (e.g. we had 200 machines each running 4 copies of DCP and finishing jobs in ~30 seconds).

Workaround is to:

ErinWeisbart commented 3 years ago

If this happens because you have a lot of jobs all finishing at the same time (but not finishing super rapidly such that there is an ever increasing backlog of files that are failing to upload to S3), alternative workaround is:

ErinWeisbart commented 3 years ago

We can also handle the second scenario (but not the first) in DCP by changing time.sleep(30) after a move error to a random int.

ErinWeisbart commented 1 year ago

This is troubleshooting information for an S3 limitation and not a bug in DCP. I have added this to documentation so this will be closed by #135.