Support for lossy CRAM in bcbio WGrS variant calling pipeline

WimSpee commented 5 years ago

Hi @chapmanb .

There was a nice GA4GH BAM/VCF compression workshop yesterday. See here for the slides. https://drive.google.com/drive/folders/1bQcgqPn_BAYNGlyCVrX_IRLDvYbQjf8t

I am wondering what your thoughts are on lossy CRAM compression and if the bcbio WGrS pipeline already supports it.

There are some good lossy CRAM compression points specified here by a group of the Broad (Broad is not yet using all of these points in production). https://drive.google.com/file/d/1aeYb8WscyAozQQz4hh4EV9wHQj-1UFep/view?usp=sharing

Drop OQs (original quality scores)
Drop read names (or better I think just make minimal unique read names needed for pair lookup)
Drop duplicate and filtered reads (I don't see any drawback? on new Illumina machines we sometimes get 40% duplicate reads)
Encode bases relative to the reference à la CRAM
Aggressive (mapping?) quality score binning (power of 2, 8 bins, 4 bins, or just 2 bins)

Using the above lossy compression methods would results in CRAM files 1/10th (ie. just c.a. 10%) of the standard BAM files.

It would be nice if the bcbio WGrS pipeline would optionally support some / most of the above lossy CRAM methods to save on data storage and transfer cost.

This is the first time that I am exploring switching to CRAM files. Do you know if it's well supported by bio-informatics software?

Thank you for your thoughts on this.

chapmanb commented 5 years ago

Wim; Thanks much for starting this discussion. I also saw those slides and am excited by the somatic validation benchmarks on 2-bin CRAMed files. That would really help with file sizes. Right now bcbio supports CRAM output with 8-bins and read removal. We already don't include original qualities and I haven't worked on removing duplicates since I generally like to keep the final output CRAM/BAM convertible back to the original fastqs so you can safely store only the aligned file and get rid of the original fastqs. This is specified with archive: cram:

https://bcbio-nextgen.readthedocs.io/en/latest/contents/configuration.html#post-processing

and the implementation uses samtools and bam squeeze:

https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/bam/cram.py

The things we'd need to explore to have more CRAM support in bcbio:

Are these better software tools for doing the compression? I don't think cramtools is currently supported but am not sure how to bin and remove read names with samtools view.
Run validations on 8, 4 and 2 binned outputs with current germline and low frequency somatic benchmarks to confirm what binning strategy we should use.
Work through supporting CRAM input directly. bazam (https://github.com/ssadedin/bazam) looks like it could be a replacement for the current fastq bgzip and grabix index approach and avoid conversion.
Test CRAM as an intermediate format. This is the trickiest bit. CRAM support is improving but there are a lot of tools to go through and ensure they work.

I'd be happy to hear other people's experiences/suggestions. Thanks again for starting this discussion.

ohofmann commented 5 years ago

Can I add 'impact of compression on (somatic) SV and CNV calls' to the shopping list :) ?

WimSpee commented 5 years ago

We don't do much somatic / pooled sequencing, for these purposes we can probably for the longer future stick to plain BAM files.

For germline WGrS we use the BAM files downstream of bcbio mostly in genome browsers. We thus would need to check that all the genome browsers we use support CRAM.

Germline small variant re-analysis on the same reference (triggered by sequencing more samples) is form gVCF. Germline small variant re-analysis on a different reference (triggered by an improved reference) is form FASTQ (we currently still store these separately). We currently store these on colder storage, so don't mind that much have these in addition to the aligned BAM/CRAM files.

Germline SV re-analysis on the same reference (triggered by sequencing more samples) indeed is from BAM. So this also is a important point for us. Hopefully the (to be released) GATK4 CNV and SV tool support CRAM and perform as good as other CNV/SV tools using BAM?

We luckily don't have a storage bill of 9M (as the Broad does). So we can wait for good tool and analysis support around CRAM, including germline CNV/SV calling. Thank you (Brand and Oliver) for in the information and your thoughts on this.

chapmanb commented 5 years ago

@yfarjoun -- we've been having a helpful discussion about improving CRAM support triggered by your great presentation at the GA4GH meeting. I'm interested in modernizing CRAM bin and read name lossy compression in bcbio and then using these for running through some of our validations. Do you have tool recommendations for creating 4-bin and 2-bin lossy CRAMs? We're using BamUtil squeeze now (https://genome.sph.umich.edu/wiki/BamUtil:_squeeze) but would be keen to move to the latest and greatest. Thanks so much.

helgridly commented 5 years ago

Hello! I'm Hussein, the person who gave the presentation at GA4GH. To answer a couple of the points raised here:

First: samtools 1.8 will let you write a CRAM's with lossy read names:

samtools view -T ref.fasta -C --output-fmt-option lossy_names=1 -o output.cram input.bam

I believe this gives your readnames an incrementing sequence number, though @jkbonfield will be able to confirm.

Second, on 2-bin QS: I wrote a crappy Picard tool that's hanging out in a branch somewhere, but nothing I'd consider production ready. The algorithm we used was "anything Q20+ becomes Q30, anything else is Q2". For our somatic validation I tested that, the deoscillation strategy I mentioned in the talk, and Crumble with the default values published in the Crumble paper (-9p8 -u30 -Q60 -D100). James has mentioned that Crumble has plenty of levers one might tinker with to improve performance in the somatic case (namely crumble -L 0 and possibly crumble -L 0 -X 1 -u 30), but I can't speak to them and haven't tested them.

I'm hoping to do a comparative analysis of somatic performance of various QS futzers, since my research has been fairly limited so far and hasn't touched on other approaches such as CALQ, QVZ2, and Quartz. But I can't promise how soon I'll get to it.

jkbonfield commented 5 years ago

Thanks for the comments and I'm glad you enjoyed the GA4GH session.

I found 2-bin quality encoding to be harmful (see my slides), but less so than having no qualities. It likely depends a lot on the quality threshold used and what the replacement values are so this may explain the disparity between our assessments, as well as there being a myriad of ways of evaluating the impact. This will need more experimentation.

Regarding lossy names, yes, this (-O cram,opt) or --output-fmt-option opt) is the easiest way to do it. When reading it'll generate new ones on-the-fly based on the filename / index in file. You can adjust the prefix using the name_prefix option. Eg:

samtools view -O cram,lossy_names -o out.cram in.bam
samtools view --input-fmt-option name_prefix=foo out.cram

One problem is none of this is remembered. For example if we view -C out.cram above in CRAM format, then it won't recode it with lossy_names again. Instead it'll generate names during the decode and then encode them as actual name strings (as it would emitting to SAM / BAM). This is just some brain dead tooling frankly. We can work around this by remembering to add lossy_names option again. (The same situation applies with embedded references.)

ohofmann commented 5 years ago

James,

did you find the 2-bin (pass/no-pass) to be harmful for NovaSeq data, too? From what I am seeing quality values of 2 and 12 indicate issues, with 23 and 37 being fine.

jkbonfield commented 5 years ago

No - I only tried Syndip because that's what I had a truth set for. If we could get someone to sequence those cell lines with NovaSeq it'd be great, but from what I understand the cell lines aren't public.

Certainly for NovaSeq the choice of values and threshold is far easier to choose from!

matthdsm commented 4 years ago

Perhaps of interest on this topic, @jkbonfield 's presentation on CRAM 3.1 and Crumble from the latest BOSC: https://www.slideshare.net/JamesBonfield/cram-31-crumble

M

bcbio / bcbio-nextgen

Support for lossy CRAM in bcbio WGrS variant calling pipeline #2531