TheJacksonLaboratory / splicing-pipelines-nf

Repository for the Anczukow-Lab splicing pipeline
14 stars 9 forks source link

Research - if we can save only compressed bams from star into results #273

Closed Vlad-Dembrovskyi closed 2 years ago

Vlad-Dembrovskyi commented 2 years ago

If it is worth in space (reduces a lot) then go for it.

Add a step to zip file in the end of Star process

If we start from the second part of pipeline we start from StringTie - add a conditional unzipping (remember we also need bai)

Vlad-Dembrovskyi commented 2 years ago

Brittany to research on sumner

angarb commented 2 years ago

@Vlad-Dembrovskyi (example bam files) 16G LIB11_Luminal/LIB11_Luminal.Aligned.sortedByCoord.out.bam 27G LIB1_Luminal/LIB1_1_Luminal.Aligned.sortedByCoord.out.bam 28G LIB5_Luminal/LIB5_2_Luminal.Aligned.sortedByCoord.out.bam 28G LIB7_Luminal/LIB7_2_Luminal.Aligned.sortedByCoord.out.bam 25G LIB9_Luminal/LIB9_3_Luminal.Aligned.sortedByCoord.out.bam

Vlad-Dembrovskyi commented 2 years ago

Task 1: research bam compression options. Since BAM is an already compressed file, there is not so much room for compression. See https://www.biostars.org/p/420404/ for example. Options: convert to CRAM, or use a more efficient compressor than gzip. Main limiting factor - cpu time. If it take too long to reduce size by 10-20%, then it may be not worth it. To test different options with big bam files. Checkout also https://academic.oup.com/bioinformatics/article/37/16/2225/6135077

Task 2: if testing shows meaningful compression - implement the compression in the end of processes that produce bams so that only cmpressed bams are saved in results folder. This has to be controlled by an optional parameter, that is by default true.

Vlad-Dembrovskyi commented 2 years ago

https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/issues/285