bokulich-lab / q2-fondue

Functions for reproducibly Obtaining and Normalizing Data re-Used from Elsewhere
BSD 3-Clause "New" or "Revised" License
20 stars 6 forks source link

Delete raw sequences after compression #151

Closed valentynbez closed 1 year ago

valentynbez commented 1 year ago

Current behavior: Sequences in TMPDIR are downloaded and saved as raw .fastq.

Proposed improvement: Because of pigz in main qiime2 environment sequences can be efficiently archived after download with pigz --fast saving up to 75% of hard-drive space - which comes in handy for big datasets. AFAIK unarchived sequences won't be used downstream.

misialq commented 1 year ago

Hey @valentynbez,

Thanks for opening up this issue. Could you maybe help me understand what you mean exactly? The sequences are being fetched by fasterq-dump behind the scenes so we don't have much control over how that's done. It's true that they are saved as fastq in the TMPDIR but right after we fetch them, we do compress them to fastq.gz. Everything else from the TMPDIR gets cleaned up at the end. The only issue I'm noticing now is that after we gzip the fastq sequences, the originals still stay in the TMPDIR, potentially occupying a lot of space, as you pointed out. I do think, however, we can simply remove them after zipping is complete as they won't be needed for anything else anymore. Is that what you meant? If not, could you please provide some more details?

Thanks!

valentynbez commented 1 year ago

If I understand correctly, the compression step is applied after sequences are downloaded during Artifact creation. pigz could be applied right after fasterq-dump in this function: https://github.com/bokulich-lab/q2-fondue/blob/704bd98b05ccc23b54157ab7115a3a491fe65d81/q2_fondue/sequences.py#L43-L68

I assume this would compress downloaded sequencing data right away, thus reducing space requirements for big studies during q2-fondue run.

I only noticed leftover sequences in $TMPDIR, wiping them out would be great as well :)

misialq commented 1 year ago

No, the compression step is happening right after download, and is performed by the rewrite_fastq funcion being called here: https://github.com/bokulich-lab/q2-fondue/blob/704bd98b05ccc23b54157ab7115a3a491fe65d81/q2_fondue/sequences.py#L234-L249

There is no need then for additional compression. We would, however, like to add a step to remove the original fastq sequences once the compression is finished, to release that space immediately. Thanks for pointing us in this direction :)

valentynbez commented 1 year ago

Yes, I misunderstood the function of q2-CasavaOneEightSingleLanePerSampleDirFmt and tmpdir.