Open glennhickey opened 5 years ago
Maybe we can solve this, #2704, and #2706 all at the same time by moving all our S3 uploads/downloads to boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client
It has its own logic for multithreaded uploads and downloads, and for uploads and downloads to and from streams.
Though to really solve this issue, we probably have to make importFile asynchronous. There's no way to make S3 itself download from a URL, and I don't think there's an easy way to do byte range reads from a remote URL to support a boto3-managed multipart upload.
+1 for the boto3 and for asynchronous importFile!
I was testing this out with a quick (and bad) asynchronous wrapper for importFile
https://github.com/vgteam/toil-vg/blob/master/src/toil_vg/vg_common.py#L956-L1027
but it never worked because the current implementation of importFile is not thread-safe.
On Thu, Jun 6, 2019 at 1:37 PM Adam Novak notifications@github.com wrote:
Maybe we can solve this, #2704 https://github.com/DataBiosphere/toil/issues/2704, and #2706 https://github.com/DataBiosphere/toil/issues/2706 all at the same time by moving all our S3 uploads/downloads to boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client
It has its own logic for multithreaded uploads and downloads, and for uploads and downloads to and from streams.
Though to really solve this issue, we probably have to make importFile asynchronous. There's no way to make S3 itself download from a URL, and I don't think there's an easy way to do byte range reads from a remote URL to support a boto3-managed multipart upload.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2707?email_source=notifications&email_token=AAG373XXBQPCKAEPLQHNROTPZFDNDA5CNFSM4HVCD2V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDTKWI#issuecomment-499594585, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373UVKQ6WWGELHAKI3BDPZFDNDANCNFSM4HVCD2VQ .
➤ Adam Novak commented:
Lon thinks each operation might need its own session to really be thread safe.
@w-gao @adamnovak Reference is here: https://boto3.amazonaws.com/v1/documentation/api/1.14.31/guide/session.html#multithreading-or-multiprocessing-with-sessions
➤ Adam Novak commented:
Lon says he is fixing this.
I often find that importing large files at the beginning of my workflow to be a large time bottleneck. For instance, if I want to map a pair of fastq files from the EBI FTP site, it can take several hours import each one using
toil.importFile()
. Same issue for, say, the phased chromosome 1000 Genomes VCFs.This time could be cut in half if I could import the two fastq files at once. This actually works fine on a local job store, but does not work on a S3 job store. If I recall, it's because the boto session object is not thread safe. But I don't think it's a major change to use multiple sessions (or maybe there's a thread-safe API somewhere?). It would be a big speedup for importing large files.
Coupling this with some kind of asynchronous / batch import interface would be even better.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-380