bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

hg38 validation - proper use of BWA alt index - missed variants in 641 genes? #3069

Open naumenko-sa opened 4 years ago

naumenko-sa commented 4 years ago

Hello everyone, especially hg38 reference genome users!

Now bcbio has hg38 support for almost 5 years! http://bcb.io/2015/09/17/hg38-validation/

However, many user were still on GRCh37/hg19, because Gnomad had native calls for this reference only. Last year, they have released calls on hg38 (https://github.com/bcbio/bcbio-nextgen/issues/2982), so it will be even more useful.

Also, there was a preprint on missing variation in 641 genes https://www.biorxiv.org/content/10.1101/868570v1.full

In brief, bwa should be called with alt index to handle reads mapping to the main and alt reference properly. https://github.com/lh3/bwa/blob/master/README-alt.md

Could you please share your experience of validating variant calling hg38 with alt? @matthdsm? is it something important to push for general usage?

Sergey

matthdsm commented 4 years ago

Hi Sergey,

TBH, we haven't done much validation past GiaB and a bunch of in house samples with known variants. As the readme states, BWA-MEM is alt aware, so we didn't expect much issues there.

Since I'm working in a clinical context, much of what we (have to) look at is located on the canonical chromosomes and as such, hasn't given us much grief.

Cheers M

ohofmann commented 4 years ago

I am working on a writeup of the issues we're facing - particularly on how to handle CN issues of callers that are not alt-aware, and how to deal with variant calls that hit genes placed both on the main contig and on alts.

gabeng commented 4 years ago

Hi Sergey,

is bcbio actually calling variants on alt contigs? If yes, how would you interpret them? If you are excluding alt contigs from downstream analysis, how are you dealing with GSTT1? I noticed that the IonTorrent Suite added the GSTT1 alt contig to its hg38 reference, but it is exluding any other alt contigs.

What do you think about this (probably slightly outdated) tutorial for GATK alt contig calling? https://gatk.broadinstitute.org/hc/en-us/articles/360037498992

Regards, Ben

kokyriakidis commented 4 years ago

Maybe this approach is the solution in general?

https://www.sevenbridges.com/graph/

amizeranschi commented 4 years ago

It does seem that reference genomes could (soon?) be replaced by graph-based pan-genomes: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1774-4.

An open-source alternative to the SevenBridges solution is the VG toolkit: https://github.com/vgteam/vg (reference: https://www.nature.com/articles/nbt.4227).

naumenko-sa commented 4 years ago

Thanks everyone for the discussion and for the links provided!

Clarifying about ALT calling: there are several kinds of extra contigs in the hg38 reference, we may see them in genomes/Hsapiens/hg38/seq/hg38.dict:

Bcbio has options

Either of these options force bcbio to call variants only on chr1,2,...22,chrX,chrY,chrM. However, if you are not using these options, that does not mean you are calling variants on ALT contigs. Variants on the unplaced contigs are called but not on alt. It is a bit misleading - I'm updating the docs to reflect that.

BWA is alt aware, but it should be properly called (with alt index, as UKbiobank article suggests). We are not doing it for now.

So, overall, we are not supporting alt contig variant calling in bcbio for now, and there is some evidence in favour to improve it:

However, in clinical/clinical research variant calling (WES, WGS) and in cancer research with panels the absence of ALT support is not a huge issue, because the genes of interest and well-annotated genes are located on the main reference contigs.

Graph based references and population specific references are definitely the future, but given how slow GRCh37/hg38 transition is going, that could be a very remote future.

SN

gabeng commented 4 years ago

Sergey,

catching up with these developments and improving the documentation is so helpful. Thank you!

Ben

kokyriakidis commented 4 years ago

Maybe you should check this out: https://varlociraptor.github.io/ https://www.biorxiv.org/content/10.1101/741256v1.full

It seems very promising! It would be extremely nice to implement this!

ohofmann commented 4 years ago

SBJ00207 Germline vs COLO829 Blood

Apologies for taking so long to get back to this, I had to look at a number of additional samples to verify some of the differences we have seen between GRCh37/38 - and I am still not sure I am getting these right.

To recap, bcbio uses bwa-mem which is ALT-aware, and it ships the reference genome with all the required bells and whistles. As a result we are not prone to dropout issues like those seen by the UK Biobank where reads that mapped equally well to the primary genome build regions and their corresponding ALT regions got discarded as mutli-mappers.

That said, I think bcbio's current handling of ALT regions is not ideal and, depending on the use case, can result in key variant information being missed. Here is an IGV view of a patient WGS (SBJ00207, blood, top) and COLO829 WGS (blood sample only, bottom) alongside their bcbio germline ensemble calls based on Strelka2, VarDict and GATK Haplotype caller:

HLA-DRA region

This is part of the chr6 HLA region (chr6:32,439,927-32,479,928) and accordingly quite noisy, but we can still see a difference in variant call density towards the right of the browser view. Looking at one of the corresponding ALT region (chr6_GL000251v2_alt:3,878,084-3,918,084) for this section shows a different picture:

HLA-DRA region ALT

Based on the coverage distribution it seems clear that COLO829 has the haplotype represented by this particular ALT contig. SBJ00207 does not show the same coverage with just sporadic areas exhibiting reads that map equally well to the primary assembly and this ALT contig. This has a number of implications.

1. Missing ALT variant calls

As discussed in this thread bcbio does not call variants in ALT regions which means variants get missed. Here is an example downstream of HLA-DRA:

Missed ALT call

2. False positive germline SNV calls

What I did not appreciate before looking at the calls in more detail is that we are also generating lots of additional false positives in the (primary) HLA region for COLO829 precisely because reads that should only map to the ALT region are also present in the primary build - something we wanted to prevent in the first place by using ALT regions in the reference genome so ALT haloptype reads would not be forced to align to their main chromosome counterpart:

FP Calls

The explanation for this is in Heng Li's description of bwa-mem's ALT handling. With bcbio we are performing this first stage:

The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of an ALT hit is computed across all hits. If there are no non-ALT hits, the best ALT hit is outputted as the primary alignment. If there are both ALT and non-ALT hits, non-ALT hits will be primary and ALT hits be supplementary (SAM flag 0x800).

...but we are not performing the (Javascript-driven) postprocessing step which groups main and supplementary hits to adjust mapping qualities and decide on a primary placement. So we end up with all reads in this region having a supplementary alignment in either the ALT or primary region:

XA

Different variant callers handle the supplementary tag in different ways, but from the image above it looks like our default germline callers are happy to use them - they are not duplicate reads, they are not multi-mappers, and they have good mapping quality. The result is a high number of false positive germline calls for regions covered by ALTs in samples with a "non-standard" haplotype.

Note this is not limited to HLA and intergenic regions alone. We've seen this pattern for other gene regions that are also covered by (HLA) ALTs including MDC1, POU5F1, LTB, STK19 or NOTCH4, some of which have clinical significance.

3. Potential issues with CNV calls

The use of the supplementary tag without post-processing impacts other downstream tools as well. We noticed the problem in the first place because our CNV caller (PURPLE) flagged regions as het2hom - a loss of heterozygosity usually associated with a copy number loss, but we couldn't confirm this in our germline SNV calls nor did coverage of the affected region drop. Here is how PURPLE handles its pileup:

COBALT starts with the raw read counts per 1,000 base window for both normal and tumor samples by counting the number of alignment starts in the respective bam files with a mapping quality score of at least 10 that is neither unmapped, duplicated, secondary, nor supplementary. Windows with a GC content less than 0.2 or greater than 0.6 or with an average mappability below 0.85 are excluded from further analysis.

As a result we are getting contradicting feedback from the SNV and CNV pipelines with further complications in the somatic use case. This is causing problems for our (clinical!) curators as they are getting different information from the SNV and CNV callers, e.g., having a copy number of 5 for POU5F1 with an LOH flag which is not true.

Note, post-processing the BAM as recommended by bwa-mem is unlikely to solve this particular issue as PURPLE would likely get confused by the drop in coverage with 50% of the reads disappearing into an ALT region, but at least we could work with the PURPLE authors to make their CNV Caller ALT-aware.

4. Variant annotation concerns

Handling of ALT regions in bcbio aside: even with post-processing enabled and variant calls supported in ALT regions we would still have work to do. In particular, we would have to port variant annotations back to the primary assembly to distinguish between, say, two independent (heterozygous) changes in NOTCH4 or a homozyguous change at the same position, one in the primary assembly and one at an ALT. I am not sure whether this is best done based on genomic coordinates or with an initial focus on protein changes for a clinical setting.

chapmanb commented 4 years ago

Oliver; Thanks for the detailed investigation. Handling ALTs right is definitely hard and calling out these specific issues out is a big help. bcbio does have the bwa-mem post-processing integrated but currently only turns it on when an 'hlacaller' is set. This choice was due to a performance tradeoff when using the javascript post-processing: it can't keep up with the aligner on multicore machines so ends but being a bottleneck. If you try setting hlacaller: optitype it could be an easy way to confirm if the post-processing helps resolve the issues you're seeing in HLA and elsewhere.

I'd been hoping that alignment (and variant, and annotation) tooling would catch up with alts and we could move to something that handled this more correctly without having to implement separate algorithms in bcbio. I'm getting out of the loop on the latest tools but might be worth investigating to see if any aligners handle these better and can project back to the reference coordinates. This was meant to be the sweet spot for vg but maybe there are other methods handling this better.

ohofmann commented 4 years ago

Brad, yep, completely agree. This isn't meant as a bug report but more to document issues we stumbled into to helpfully help others make sense of some of the odd results. We are looking into DRAGEN which has a different strategy to place reads between primary and ALTs but I think we'd run into similar issues with non-ALT aware callers. I am also running the same samples with post-processing enables and will report back.

naumenko-sa commented 4 years ago

Thanks Oliver, thanks Brad!

I did not realize that hlacaller: optitype affects alignment and variant calling on all chr, not just HLA regions, good to know! It might also help in the other issue where we are profiling bcbio performance. HLA typing is sometimes in the configs 'just to have it', and could significantly decrease performance, especially on AWS!

Sergey

naumenko-sa commented 4 years ago

upd: after a lot of testing it was shown that turning hlacaller: optitype alters the alignment (all alignment goes through bwa_alt.js script, it increases supplementary alignments somethis +30%, alters mapping scores, and creates false positives. This is for high coverage panels. So for now that hla typing should be done in a separate run from somatic variant calling. In a future release these two will be separated.

todo: