DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
896 stars 241 forks source link

Thread-safe file import on S3 #2707

Open glennhickey opened 5 years ago

glennhickey commented 5 years ago

I often find that importing large files at the beginning of my workflow to be a large time bottleneck. For instance, if I want to map a pair of fastq files from the EBI FTP site, it can take several hours import each one using toil.importFile(). Same issue for, say, the phased chromosome 1000 Genomes VCFs.

This time could be cut in half if I could import the two fastq files at once. This actually works fine on a local job store, but does not work on a S3 job store. If I recall, it's because the boto session object is not thread safe. But I don't think it's a major change to use multiple sessions (or maybe there's a thread-safe API somewhere?). It would be a big speedup for importing large files.

Coupling this with some kind of asynchronous / batch import interface would be even better.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-380

adamnovak commented 5 years ago

Maybe we can solve this, #2704, and #2706 all at the same time by moving all our S3 uploads/downloads to boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client

It has its own logic for multithreaded uploads and downloads, and for uploads and downloads to and from streams.

Though to really solve this issue, we probably have to make importFile asynchronous. There's no way to make S3 itself download from a URL, and I don't think there's an easy way to do byte range reads from a remote URL to support a boto3-managed multipart upload.

glennhickey commented 5 years ago

+1 for the boto3 and for asynchronous importFile!

I was testing this out with a quick (and bad) asynchronous wrapper for importFile

https://github.com/vgteam/toil-vg/blob/master/src/toil_vg/vg_common.py#L956-L1027

but it never worked because the current implementation of importFile is not thread-safe.

On Thu, Jun 6, 2019 at 1:37 PM Adam Novak notifications@github.com wrote:

Maybe we can solve this, #2704 https://github.com/DataBiosphere/toil/issues/2704, and #2706 https://github.com/DataBiosphere/toil/issues/2706 all at the same time by moving all our S3 uploads/downloads to boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#client

It has its own logic for multithreaded uploads and downloads, and for uploads and downloads to and from streams.

Though to really solve this issue, we probably have to make importFile asynchronous. There's no way to make S3 itself download from a URL, and I don't think there's an easy way to do byte range reads from a remote URL to support a boto3-managed multipart upload.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/2707?email_source=notifications&email_token=AAG373XXBQPCKAEPLQHNROTPZFDNDA5CNFSM4HVCD2V2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXDTKWI#issuecomment-499594585, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG373UVKQ6WWGELHAKI3BDPZFDNDANCNFSM4HVCD2VQ .

unito-bot commented 3 years ago

➤ Adam Novak commented:

Lon thinks each operation might need its own session to really be thread safe.

DailyDreaming commented 3 years ago

@w-gao @adamnovak Reference is here: https://boto3.amazonaws.com/v1/documentation/api/1.14.31/guide/session.html#multithreading-or-multiprocessing-with-sessions

unito-bot commented 3 years ago

➤ Adam Novak commented:

Lon says he is fixing this.