datacarpentry / organization-genomics

Project Organization and Management for Genomics
https://datacarpentry.org/organization-genomics
Other
23 stars 76 forks source link

parallel-fastq-dump #104

Closed ErinBecker closed 5 years ago

ErinBecker commented 5 years ago

@mgalland opened issue #52 in the genomics-workshop repo, which is about curricular content included in this lesson. I'm moving this issue over to this repo, as the Maintainers here will be more qualified to understand and act on this suggestion.

hoytpr commented 5 years ago

@ErinBecker this looks like a great tool, but maybe the HPC lesson is a better spot for this. There are a few tools for faster data transfer, and it's a great topic, but for this lesson we intentionally are using a small dataset. IMHO.

hoytpr commented 5 years ago

I believe we can close this, although it seems appropriate to re-write the lesson to make use of the new tools. @ErinBecker According to NCBI (Ben Busby?) and on https://github.com/ncbi/sra-tools, "With release 2.9.1 of sra-tools we have finally made available the tool fasterq-dump, a replacement for the much older fastq-dump tool. As its name implies, it runs faster, and is better suited for large-scale conversion of SRA objects into FASTQ files that are common on sites with enough disk space for temporary files. fasterq-dump is multi-threaded and performs bulk joins in a way that improves performance as compared to fastq-dump, which performs joins on a per-record basis (and is single-threaded).

fastq-dump is still supported as it handles more corner cases than fasterq-dump, but it is likely to be deprecated in the future.

You can get more information about fasterq-dump in our Wiki at https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump."

ErinBecker commented 5 years ago

Thanks for the feedback @hoytpr. I'm pinging @ACharbonneau to see if she wants to try to incorporate this into the Cloud lesson.

JasonJWilliamsNY commented 5 years ago

Arizona BugBBQ - We don't think any importing from SRA is needed for this workshop. Learners should be given skills that will be make this easier on its own. There tools that will pull from SRA without using NCBI tools, etc.

hoytpr commented 5 years ago

It's true that there are tons of ways to download, and life science students probably know a couple of ways. Providing the data with a link for interested learners is a great way to save time for more important items.

hoytpr commented 5 years ago

@JasonJWilliamsNY and @ErinBecker Because the curl and wget functions worked very well in the Arizona BugBBQ, and because the original/predecessor fastq-dump is already multi-threaded, (even if not needed), I'm going to close this issue.