Closed valentynbez closed 1 year ago
Hey @valentynbez,
Thanks for opening up this issue. Could you maybe help me understand what you mean exactly? The sequences are being fetched by fasterq-dump behind the scenes so we don't have much control over how that's done. It's true that they are saved as fastq in the TMPDIR but right after we fetch them, we do compress them to fastq.gz. Everything else from the TMPDIR gets cleaned up at the end. The only issue I'm noticing now is that after we gzip the fastq sequences, the originals still stay in the TMPDIR, potentially occupying a lot of space, as you pointed out. I do think, however, we can simply remove them after zipping is complete as they won't be needed for anything else anymore. Is that what you meant? If not, could you please provide some more details?
Thanks!
If I understand correctly, the compression step is applied after sequences are downloaded during Artifact
creation.
pigz
could be applied right after fasterq-dump
in this function: https://github.com/bokulich-lab/q2-fondue/blob/704bd98b05ccc23b54157ab7115a3a491fe65d81/q2_fondue/sequences.py#L43-L68
I assume this would compress downloaded sequencing data right away, thus reducing space requirements for big studies during q2-fondue
run.
I only noticed leftover sequences in $TMPDIR
, wiping them out would be great as well :)
No, the compression step is happening right after download, and is performed by the rewrite_fastq
funcion being called here:
https://github.com/bokulich-lab/q2-fondue/blob/704bd98b05ccc23b54157ab7115a3a491fe65d81/q2_fondue/sequences.py#L234-L249
There is no need then for additional compression. We would, however, like to add a step to remove the original fastq sequences once the compression is finished, to release that space immediately. Thanks for pointing us in this direction :)
Yes, I misunderstood the function of q2-CasavaOneEightSingleLanePerSampleDirFmt
and tmpdir
.
Current behavior: Sequences in
TMPDIR
are downloaded and saved as raw.fastq
.Proposed improvement: Because of
pigz
in mainqiime2
environment sequences can be efficiently archived after download withpigz --fast
saving up to 75% of hard-drive space - which comes in handy for big datasets. AFAIK unarchived sequences won't be used downstream.