bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
992 stars 354 forks source link

review bam to cram compression #3107

Closed matthdsm closed 4 years ago

matthdsm commented 4 years ago

Hi,

Would you guys be open to reviewing the bam to cram conversion in bcbio? Currently, this is done using either samtools or bam-squeeze. I propose replacing both with scramble, which should be faster and uses state of the art compression. The tool can be installed using conda and represents a minimal dependency.

Let me know what you think and if I should put some work towards this.

Cheers M

matthdsm commented 4 years ago

for reference, scramble options

-bash-4.2$ scramble -h
  -=- sCRAMble -=-     version 1.14.11
Author: James Bonfield, Wellcome Trust Sanger Institute. 2013-2018

Usage:    scramble [options] [input_file [output_file]]
Options:
    -I format      Set input format:  "bam", "sam" or "cram".
    -O format      Set output format: "bam", "sam" or "cram".
    -1 to -9       Set compression level.
    -0 or -u       No compression.
    -H             [SAM] Do not print header
    -R range       [Cram] Specifies the refseq:start-end range
    -r ref.fa      [Cram] Specifies the reference file.
    -b integer     [Cram] Max. bases per slice, default 5000000.
    -s integer     [Cram] Sequences per slice, default 10000.
    -S integer     [Cram] Slices per container, default 1.
    -V version     [Cram] Specify the file format version to write (eg 1.1, 2.0)
    -e             [Cram] Embed reference sequence.
    -x             [Cram] Non-reference based encoding.
    -M             [Cram] Use multiple references per slice.
    -m             [Cram] Generate MD and NM tags.
    -Z             [Cram] Also compress using lzma.
    -f             [Cram] Also compression using fqzcomp (V3.1+)
    -n             [Cram] Discard read names where possible.
    -P             Preserve all aux tags (incl RG,NM,MD)
    -p             Preserve aux tag sizes ('i', 's', 'c')
    -q             Don't add scramble @PG header line
    -N integer     Stop decoding after 'integer' sequences
    -t N           Use N threads (availability varies by format)
    -B             Enable Illumina 8 quality-binning system (lossy)
    -!             Disable all checking of checksums
    -g FILE        Convert to Bam using index (file.gzi)
    -G FILE        Output Bam index when bam input(file.gzi)
ohofmann commented 4 years ago

+1 on scramble.

roryk commented 4 years ago

Thanks so much, sorry for not responding. Yes we're definitely up for it. I'll test your p/r locally.

matthdsm commented 4 years ago

No problem, let me know if you need anything else. Local tests check out nicely.

Cheers M

naumenko-sa commented 4 years ago

Thanks @roryk and @matthdsm! I see it has been released in bcbio 1.2.2! https://github.com/bcbio/bcbio-nextgen/commit/037fa09812556698233e477fe9e96cecfee21d37 Closing to celebrate this achievement!