Closed ErinWeisbart closed 1 year ago
If this happens because you have a lot of jobs all finishing at the same time (but not finishing super rapidly such that there is an ever increasing backlog of files that are failing to upload to S3), alternative workaround is:
SECONDS_TO_START
in config.py
so that there is more separation between jobs finishing.We can also handle the second scenario (but not the first) in DCP by changing time.sleep(30) after a move error to a random int.
This is troubleshooting information for an S3 limitation and not a bug in DCP. I have added this to documentation so this will be closed by #135.
In Cloudwatch we see:
== ERR move failed: local_output/PLATE/SegmentationCheck_Image.csv to s3://BUCKET/projects/PROJECT/BATCH/images_segmentation/PLATE/SegmentationCheck_Image.csv An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.
We don't see the error immediately but it instead becomes more and more prevalent in our logs with time. We believe this is happening because we have too many machines finishing too many jobs very quickly. (e.g. we had 200 machines each running 4 copies of DCP and finishing jobs in ~30 seconds).
Workaround is to: