Sage-Bionetworks / synapseDocs

Synapse documentation pages
https://docs.synapse.org
7 stars 36 forks source link

add a page on how Synapse file upload and download work #535

Open brucehoff opened 5 years ago

brucehoff commented 5 years ago

File transfer is a core feature of Synapse and there's a lot of engineering under the hood to make it efficient. It may be worth adding a page that discusses some of the features of file transfer:

- Synapse uses AWS Simple Storage Solution (S3) by default for file storage.  When a file is transferred between a user (or, more specifically a client such as a web browser) and S3, the file travels directly between the two.  The file does not, for example, go to the Synapse server and then to S3.

- When uploading files, Synapse uses a feature of S3 by which files are transferred in 5MB "chunks" and then reassembled in S3.  When failures occur Synapse has the logic to retry just the failed chunks and not to re-upload all of them.  Similarly, when downloading files, should a failure occur as the bytes are streamed from S3 to the client, Synapse's clients have the logic to restart at the point of failure, rather than downloading an entire file from the start.

- When uploading files, the Synapse Python client (and by extension the Synapse R client) uses multiple processor threads to upload file chunks in parallel.  The Synapse web portal does not do this, however, multithreading is only helpful if the 'bottleneck' is the speed of a single process rather than the network connectivity from the client to S3.  If a single thread can saturate the network connection then parallelization would not be expected to accelerate file upload.

- Network bandwidth can vary greatly and is subject to many factors.  For example, network engineers can "shape" (selectively throttle between endpoints) traffic (https://en.wikipedia.org/wiki/Traffic_shaping).  In March, 2017, there was an event in which we discovered that traffic between FHCRC and S3 was heavily throttled.  Sage employees had better bandwidth to S3 from home than work.   The traffic "shaping" by S3 was confirmed via discussions between FHCRC and AWS engineers (https://sagebionetworks.jira.com/browse/PLFM-4277).  If users are concerned that file transfer to Synapse is "slow" they should measure transfer speed from their network to S3 when using the standard S3 client (CLI)(https://aws.amazon.com/cli/).  If a large performance disparity in file transfer speed is observed between the Synapse client and the S3 CLI then Synapse engineers will investigate.

- As stated above, S3 is the default file storage.  Synapse users are free to use other storage and simply "link" to their files URLs in Synapse.  As with S3 file transfer the speed of transfer for such linked files depends on the connectivity to the referenced storage system.
brucehoff commented 5 years ago

It would be useful to have a page with broad/general guidance when collaborating on large file transfer. If someone at Sage wants to move 100TB to (or from) Synapse, what do they do? How do they find out if their collaborator has enough bandwidth to transfer the data? How do they figure out what the best tool is to use? How long will it take and how much would it cost (if anything)? What steps are done by the Sage scientist and what steps are done by the collaborator?

kdaily commented 5 years ago

Good suggestion @brucehoff. @ychae wrote a nice doc of one way we've done this:

https://github.com/Sage-Bionetworks/AD-DCC/blob/master/large_data_uploads.md

kdaily commented 5 years ago

Also, to partially answer the question (and ask if it's a tool you've used elsewhere) I've used this site to test AWS connection before (linked for S3, but they have lots of other tests available):

http://cloudharmony.com/speedtest-for-aws:s3-us-east-1-and-aws:s3-us-east-2