Closed Vlad-Dembrovskyi closed 2 years ago
This benchmark was performed using the bam2cram workflow on the following dataset:
BAM | Fraction | BAM Size (B) | BAM Size (GB) | S3 |
---|---|---|---|---|
DRR260185.bam | 1 | 98517385363 | 91.8 | s3://eu-west-1-example-data/nihr/bam/DRR260185.bam |
SS75_DRR260185.bam | 0.75 | 73888039022 | 54.4 | s3://eu-west-1-example-data/nihr/bam/SS75_DRR260185.bam |
SS50_DRR260185.bam | 0.50 | 49258692682 | 37.2 | s3://eu-west-1-example-data/nihr/bam/SS50_DRR260185.bam |
SS25_DRR260185.bam | 0.25 | 24629346341 | 19.8 | s3://eu-west-1-example-data/nihr/bam/SS25_DRR260185.bam |
SS10_DRR260185.bam | 0.10 | 9851738536 | 9.0 | s3://eu-west-1-example-data/nihr/bam/SS10_DRR260185.bam |
Profile | CRAM versions | Options |
---|---|---|
default | 3.0 | |
fast | 3.0 | seqs_per_slice=1000, level=1 |
normal | 3.0 | seqs_per_slice=10000 |
small | 3.0 | seqs_per_slice=25000, level=6,use_bzip2 |
archive | 3.0 | seqs_per_slice=100000,level=7,use_bzip2 |
archive lzma | 3.0 | seqs_per_slice=100000,level=7,use_bzip2, use_lzma |
As of samtools version 1.14, CRAM 3.1 is available but this has been ignored from the benchmark results as this is not yet a ratified GA4GH standard and, being currently in development, it's not advised to be used for long-term storage.
Of the profiles tested, samtools_archive_lzma_30
showed the highest average compression rate, but it took a significant amount of time (on average 49 minutes and 22 seconds). The next highest compression rate was obtained by the samtools_archive_30
profile, which, on average, took less 14 minutes to compress the file (average of 35 minutes and 36 seconds).
Considered not worth it for now.
BAMs take a lot of space. CRAM files are much smaller, we could convert to CRAM for long-term storge. Only one caveat with this solution is that also a FASTA file used to create BAMs and CRAMs needs to be stored alongside the CRAMs, otherwise information in them will not be recoverable.
We have done a research on how well does the CRAM compression works: link In a nutshell, average compression is 60%, average time to compress a 100 GB BAM file is ~1h on 11 CPUs.
Implementation suggestion:
--bams
parameter instead of--reads
.