MIT-LCP / physionet-build

The new PhysioNet platform.
https://physionet.org/
BSD 3-Clause "New" or "Revised" License
55 stars 19 forks source link

S3 sync performance improvements #2203

Closed bemoody closed 2 months ago

bemoody commented 3 months ago

Uploading files to S3 is coming along, but it's slow. The server just spent about 2 days uploading one database (143 GB, 57k files).

There also was a problem recently where a project's zip file was missing, so the server tried five times (re-uploading the entire project each time.)

There are a couple of things we could do to improve the situation without changing the S3 upload code itself:

Here are some things we could do to improve the S3 upload logic:

  1. Upload files in order and track progress, so that when the upload task is interrupted/retried, it can be resumed without restarting the whole process.

  2. Check checksums and avoid uploading files that haven't changed when re-uploading a project.

  3. Detect files that already exist in S3 (in a previous project version) and do an S3-to-S3 copy instead of re-uploading.

bemoody commented 2 months ago

(Keep in mind, all these problems apply equally to GCP.)

I'm somewhat inclined to throw away the Python upload code and just use rclone or awscli.

bemoody commented 2 months ago

Duplicate of issue #1903