TheJacksonLaboratory / splicing-pipelines-nf

Repository for the Anczukow-Lab splicing pipeline
14 stars 9 forks source link

(Low prio) Compress BAMs to CRAMs for storage #285

Closed Vlad-Dembrovskyi closed 2 years ago

Vlad-Dembrovskyi commented 2 years ago

BAMs take a lot of space. CRAM files are much smaller, we could convert to CRAM for long-term storge. Only one caveat with this solution is that also a FASTA file used to create BAMs and CRAMs needs to be stored alongside the CRAMs, otherwise information in them will not be recoverable.

We have done a research on how well does the CRAM compression works: link In a nutshell, average compression is 60%, average time to compress a 100 GB BAM file is ~1h on 11 CPUs.

Implementation suggestion:

imendes93 commented 2 years ago

This benchmark was performed using the bam2cram workflow on the following dataset:

BAM Fraction BAM Size (B) BAM Size (GB) S3
DRR260185.bam 1 98517385363 91.8 s3://eu-west-1-example-data/nihr/bam/DRR260185.bam
SS75_DRR260185.bam 0.75 73888039022 54.4 s3://eu-west-1-example-data/nihr/bam/SS75_DRR260185.bam
SS50_DRR260185.bam 0.50 49258692682 37.2 s3://eu-west-1-example-data/nihr/bam/SS50_DRR260185.bam
SS25_DRR260185.bam 0.25 24629346341 19.8 s3://eu-west-1-example-data/nihr/bam/SS25_DRR260185.bam
SS10_DRR260185.bam 0.10 9851738536 9.0 s3://eu-west-1-example-data/nihr/bam/SS10_DRR260185.bam
Profile CRAM versions Options
default 3.0
fast 3.0 seqs_per_slice=1000, level=1
normal 3.0 seqs_per_slice=10000
small 3.0 seqs_per_slice=25000, level=6,use_bzip2
archive 3.0 seqs_per_slice=100000,level=7,use_bzip2
archive lzma 3.0 seqs_per_slice=100000,level=7,use_bzip2, use_lzma

As of samtools version 1.14, CRAM 3.1 is available but this has been ignored from the benchmark results as this is not yet a ratified GA4GH standard and, being currently in development, it's not advised to be used for long-term storage.

Of the profiles tested, samtools_archive_lzma_30 showed the highest average compression rate, but it took a significant amount of time (on average 49 minutes and 22 seconds). The next highest compression rate was obtained by the samtools_archive_30 profile, which, on average, took less 14 minutes to compress the file (average of 35 minutes and 36 seconds).

Vlad-Dembrovskyi commented 2 years ago

Considered not worth it for now.