== ERR move failed - Githubissues

ErinWeisbart commented 3 years ago

In Cloudwatch we see: == ERR move failed: local_output/PLATE/SegmentationCheck_Image.csv to s3://BUCKET/projects/PROJECT/BATCH/images_segmentation/PLATE/SegmentationCheck_Image.csv An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.

We don't see the error immediately but it instead becomes more and more prevalent in our logs with time. We believe this is happening because we have too many machines finishing too many jobs very quickly. (e.g. we had 200 machines each running 4 copies of DCP and finishing jobs in ~30 seconds).

Workaround is to:

check out fewer machines at a time
check out smaller machines and run fewer copies of DCP at the same time and/or
group jobs in larger groupings (i.e. by Plate or Well instead of Site). We opted for this because it reduces the number of jobs (and therefore machines to do the jobs) but those jobs will still finish in a reasonable timeframe.

ErinWeisbart commented 3 years ago

If this happens because you have a lot of jobs all finishing at the same time (but not finishing super rapidly such that there is an ever increasing backlog of files that are failing to upload to S3), alternative workaround is:

Increase SECONDS_TO_START in config.py so that there is more separation between jobs finishing.

ErinWeisbart commented 3 years ago

We can also handle the second scenario (but not the first) in DCP by changing time.sleep(30) after a move error to a random int.

ErinWeisbart commented 1 year ago

This is troubleshooting information for an S3 limitation and not a bug in DCP. I have added this to documentation so this will be closed by #135.

DistributedScience / Distributed-CellProfiler

== ERR move failed #123