galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
164 stars 439 forks source link

samtools_bam_to_cram does not store genome deltas #6523

Open alpapan opened 3 weeks ago

alpapan commented 3 weeks ago

Problem described here https://github.com/GMOD/jbrowse-components/issues/4637

Basically, i'm not sure why it is using the --output-fmt-option no_ref since it requires a reference just above that line.

            -T "\$reffa"
            -t "\$reffai"

by including --output-fmt-option no_ref, we lose the differences to the genome which makes jbrowse2 visualisation not work correctly. was there a benefit using this option?

bernt-matthias commented 3 weeks ago

The problem is that the cram format would not store the reference, but only the location of the reference. In distributed compute settings (pulsar) the location of the reference will be different during the job. Hence it won't work in all settings.

Maybe the new embed_ref=0|1 output option could be of help?

cmdcolin commented 3 weeks ago

The problem is that the cram format would not store the reference, but only the location of the reference

This problem is not unique to "hpc distributed compute". It is applies to almost anyone who uses CRAM for basically any purpose. People who use CRAM should be (I would claim: they are) accustomed to explicitly providing the reference genome via a -T argument (e.g. in the case of using samtools) anytime they use it, because the automatic lookup mechanisms are quite limited. Samtools is pretty good at throwing an error when users do not specify the reference properly or the reference is not found or the wrong reference is provided (the error may be confusing, but it occurs)

The no_ref flag "materially changes how CRAM data is stored" to not include any information about SNPs. it basically removes the concept of using "reference based compression" which is what CRAM is designed for. It could have complex downstream effects on variant analysis that users do. You can see in the jbrowse thread: there are no SNPs! That's because with no_ref, no SNPs are encoded in the file!

Any tool that is faced with a no_ref file would have to do "heroics" (read: things it shouldn't have to do) to reconstruct the SNPs by comparing the CRAM file against the reference genome (again, the reference genome has to get looped back at some point!) so why not just encode the delta between the reference genome and the sequence data in the CRAM file itself?

cmdcolin commented 3 weeks ago

(I am also not an advocate for the embedded reference concept, though it is debateable)

bernt-matthias commented 2 weeks ago

So, what would be a solution to this problem?

cmdcolin commented 2 weeks ago

turn off using no_ref=1 by default. there is no problem that is being solved by it

cmdcolin commented 2 weeks ago

if it is not clear what i mean by "there is no problem being solved by it", i mean specifically that CRAM always requires a reference file to be retrieved at some point. CRAM decoding tools almost always allow user specified reference genomes to be added. so instead of saying "we can't know where the reference is, so we will add no_ref" the answer is "we will use the default, we don't need to add no_ref, because if the user receives any error, then they will know to add a -T argument to samtools to specify their reference genome path"

bernt-matthias commented 2 weeks ago

FYI https://github.com/galaxyproject/tools-iuc/pull/1963#discussion_r207997309

Also it seems that we use no_ref in many tools.

I try to draw attention to this issue. If we fix it, we will need to fix it it quite a few places.

alpapan commented 2 weeks ago

FYI https://github.com/galaxyproject/tools-iuc/pull/1963#discussion_r207997309

Also it seems that we use no_ref in many tools.

I try to draw attention to this issue. If we fix it, we will need to fix it it quite a few places.

Fingers crossed this won't lead to any GWAS paper retractions :-)

cmdcolin commented 2 weeks ago

I did a little more analysis of data files that are created using the no_ref option. I tried for example, creating a "samtools mpileup" on a file with and without no_ref added. To my surprise, the mpileup's turn out they are the same. If I delete the original reference file, running mpileup on the cram file without no_ref causes an error, but running it on the cram file with no_ref still works, and, the mpileup is still bit-identical with when the reference file was still on disk.

This was interesting and surprising to me so I had to figure out how the no_ref was resolving the genome delta's (e.g. the SNPs...). Turns out samtools uses a trick thatJBrowse was not aware : the trick is, I believe, that no_ref files are actually given an MD tag. This was surprising because MD is normally used in BAM and I was not aware of it's usage in CRAM, but with the MD tag in a CRAM file with no_ref, you CAN actually calculate the genome deltas.

JBrowse, in this case, can add a feature (or bugfix depending on how you view it) to use the MD tag in CRAM files with no_ref to solve the bug that @alpapan initially reported to us.

I still consider it somewhat odd to use the no_ref behavior but I can walk back my hard-line stance against it. It looks like Galaxy has been using it since 2018

I don't have too many external links on this as information is sparse on the web about no_ref but i hope this helps

cmdcolin commented 2 weeks ago

I would be curious if you have any further thoughts on this @jkbonfield or any other context for no_ref behavior, but with the MD tag, jbrowse should be able to get the SNPs from no_ref cram files

jkbonfield commented 2 weeks ago

CRAM can compute MD on the fly if it's given a reference as the SEQ column is stored as the delta between the reference and the SAM sequence string. That delta is basically the same thing as MD, so any two of the three can infer the other. Although where CRAM made a mistake was not having a mechanism to indicate whether MD (and NM) tags were present or not, hence you always get them irrespective (or never if you ask not to generate them).

Obviously if we're switching to no-ref mode, then the CRAM sequence is stored verbatim and not as a delta against the (non-existant) reference. That also means we cannot preserve the MD tag and we'd have data loss. CRAM instead stores MD verbatim, just as it would any other string aux tag. I wouldn't expect it to add MD tags where they didn't previously exist however. [Note HTSlib also stores MD if it would be different to the auto-decoded one, as people complained about round-trip failures. It turns out people prefer it to round trip erroneous data, bugs and all, than having the equivalent of a builtin "samtools calmd" correction step. Sad, but understandable.]

I don't understand the use of no_ref, although it was perhaps just because it's simple to implement and cuts out large swathes of the CRAM spec. I can understand the desire to avoid external references as they can be a pain in the derrier at times. However it would be far better to use an embedded reference instead. Newer htslib even defaults to something similar to this if no external reference can be identified. It can compute a reference if the data is chr/pos aligned by either looking at SEQ + MD to derive the ref seq used or failing that by performing a trivial consensus calculation from the overlapping sequence data. This permits us to store the reference for that CRAM slice and then delta sequence against it. Nominally it's no different than just storing sequence as-is and letting the LZ (it's using gzip) algorithm deduplicate things for us, but practically speaking gzip does a terrible job and it always comes out considerably larger than doing the delta ourselves.

cmdcolin commented 1 week ago

thanks so much for providing this perspective @jkbonfield, I may have overstated the importance of the MD tag in my above assessment, I'll have to do maybe even a further analysis

"However it would be far better to use an embedded reference instead. Newer htslib even defaults to something similar to this if no external reference can be identified. "

do you have any examples of CRAM files that use these techniques?

jkbonfield commented 1 week ago

Not online, because ENA can't accept them yet as far as I'm aware :-(

However you can produce them easily in samtools.

# external reference based
samtools view -O cram -T x.ref in.bam -o out.cram

# embedded reference
samtools view -O cram,embed_ref -T x.ref in.bam -o out.cram

# embedded consensus
samtools view -O cram,embed_ref=2 in.bam -o out.cram

# no reference
samtools view -O cram,no_ref in.bam -o out.cram

You can verify what it's doing with samtools cram-size out.cram too and look for the various blocks being used and their relative sizes. Also this shows how much space is taken up in MD and NM tags. Also you can try use_bzip2 or use_lzma (slow) for no_ref mode, which may dramatically reduce storage. Gzip (deflate) just isn't that good a tool for sequence compression, and it fails dismally once we get to long-read sequences as the length of an individual sequence may be longer than the maximum gzip window.

The purpose of embedded consensus was basically to make CRAM viable for dealing with assemblies where we don't want to distribute a reference (it's just something else to go wrong). Furthermore if we're doing something that turns CRAM into another CRAM, such as filtering, then it can automatically detect the lack of a known reference and recreate the new CRAM from scratch again. (It's possible to extract the embedded reference and supply it as a reference file, but it's problematic and likely leads to more pitfalls about not supplying both CRAM & ref.)

Using an external reference doesn't save that much either. It's basically 2-bits per covered base regardless of alignment depth. For very shallow data that's ~700Mb off a human genome, which may be a significant portion. For deep data it's still ~700Mb less, but that's perhaps only 5-10% of the storage costs. Most of the gain over BAM came from collating data into columns and smarter compression codecs.

cmdcolin commented 1 week ago

thanks, that should be a good start!

alpapan commented 2 days ago

it seems that this was due to an old 2018 decision https://github.com/galaxyproject/tools-iuc/pull/1963 (according to a comment i found in iuc/samtools_view/32dc5f781059/samtools_view/samtools_view.xm)