CRAM not lossless - Githubissues

GMOD / jbrowse-components

Source code for JBrowse 2, a modern React-based genome browser

https://jbrowse.org/jb2

Apache License 2.0

211 stars 63 forks source link

CRAM not lossless #4637

Closed alpapan closed 3 weeks ago

alpapan commented 3 weeks ago

version jb2 2.15.4

I brought this up with Ian when i saw him as i don't know if it is a CRAM format issue or a jb2 one: CRAM is meant to be lossless but as the screenshot shows, this doesn't seem to be the case:

i can provide access to files confidentially via email/chat if needed.

cmdcolin commented 3 weeks ago

If you can provide the files it would be great can email to colin.diesh@gmail.com (ref fasta and cram and maybe bam too)

cmdcolin commented 3 weeks ago

Checked out the email....It looks like your CRAM file is using a weird option called "--output-fmt-option no_ref" where specifying it like that probably implies no_ref=1

This is an odd option and is the cause of this issue. The CRAM docs say about this flag http://www.htslib.org/doc/samtools.html

no_ref=0|1

    CRAM output only; defaults to 0 (off). If 1, sequences will be stored verbatim with no reference encoding. This can be useful if no reference is available for the file.

this means that the reads don't store any of the deltas with the reference genome....that's why the snp's are all lost. if you are able, I would just turn off this option and then everything should work :)

cmdcolin commented 3 weeks ago

note: an ambitious effort could render no_ref cram files properly in jbrowse 2...it would be similar to how snps are calculated on the fly in bam for bam files with no MD tag.

the best case is to just not use no_ref though, as on the fly calculations are slow

alpapan commented 3 weeks ago

yeap that was it. I didn’t know -no_ref was included in the creation (and i never checked)!

this must have been done by Galaxy’s tool XML So I went and recreated a CRAM from the bam myself and you’re right, problem gone.

it’s hard-coded in the galaxy tool xml: 
            -o '${output_alignment}'
            -T "\$reffa"
            -t "\$reffai"
            --output-fmt-option no_ref
            infile

(why on earth are they doing it since the parameter list clearly includes the reference!) going to find the galaxy issue tracker and link to here.

alpapan commented 3 weeks ago

note: an ambitious effort could render no_ref cram files properly in jbrowse 2...it would be similar to how snps are calculated on the fly in bam for bam files with no MD tag.

the best case is to just not use no_ref though, as on the fly calculations are slow

question: when using BAMs, is the jb2 best practice to run calmd and to get the md tag? like it will be more efficient? (i currently do this but i never documented why i did)

cmdcolin commented 2 weeks ago

yes, it is best for there to be MD tags in BAM files. MD says where the SNPs are. without it, jbrowse will manually look at the reference sequence for each read and find the SNPs since CIGAR doesn't tell you that. fun historical note: JBrowse 1 did not have this routine for a long time so it required MD (from the calmd command), but now we can display the SNPs without MD, its just a bit slower

note that CRAM does not use CIGAR or MD really, it uses CRAM's own encoded data structures that replace CIGAR and MD. CIGAR is pretty fundamental, not a tag, but it gets converted into something entirely different in CRAM internal encoding, and that internal encoding is normally combined with the MD type data, in what we call "read features". so that's why you wouldn't really see MD in CRAM. the no_ref thing from this thread is quite similar in effect to not having MD though.