Open DarioS opened 2 years ago
Hi Dario
I have checked your data and can see that alt-aware mapping was carried out. You have the .alt index file in place, the alt contigs have the AH tag present in the SAM header indicating alt-aware handling, and you have reads with pa tags indicating alt aware handling.
I have checked the alt contig in question and its corresponding location on chr17 primary reference sequence, and there are primary reads with non-zero mapping quality that also map to the alt contig.
Have a look at section 3 here https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/tutorials/8017-how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38 for an example of reads that map to both the primary and alt contig, and a description of the PA tag.
It looks like SNVs in hg38 regions with alternate contigs might always be missed
No this is not correct. You can verify this by counting the number of SNVs in your final VCF in a hg38 primary region with alt contig.
I still think that this is a real issue. Let's use ST-E00127:578:HJKY7ALXX:4:2106:25723:5089 from CSCC_0004-M1.final.bam as an example. It has three perfect alignments; chr17 and two alternate contigs. You can quickly verify its hg38 alignments using BLAT.
Read sequence
AGCTCAGTACAGGACGTGTCCGGGCTGCCACACAGGTGCCATCAGATGATGAAGAGACGAGATTCAGTCGTTGCTTCTTATGTAATTGTTCCTCAGCATCAGAGCTGTCACCTGGAATGTGGTCTGCCAAGACAGGCTGAAGACTATACAA
Alignments of read to genome
It seems to be lifted back to chr17, but it has a MAPQ of 0 on both chr17 and the alt. So, it wouldn't be used for SNV calling nor copy number.
Compare
Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 147
----------------------
Mapping = Primary @ MAPQ 0
Reference span = chr17:46,066,590-46,066,740 (-) = 151bp
Cigar = 151M
to
Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 2195
----------------------
Mapping = Supplementary @ MAPQ 0
Reference span = chr17_KI270908v1_alt:768,689-768,839 (-) = 151bp
Cigar = 151M
It's missing from the third perfect mapping location on chr17_GL000258v2_alt.
We have found problems with copy number variant calling and there would also be a problem with SNVs here in cancer samples which had SNVs in a gene such as KANSL1. I think this explains the enrichment of MAPQ 0 reads over exons (exons are more conserved than introns and more likely to result in score ties) and MAPQ > 0 reads enriched in introns. But, reads like the one above would be fine if hg19 or a transcriptome FASTA reference was used. Do you agree that it is a problem?
It does appear odd that the mapping quality of the primary alignment has not been assigned > 0. I am also interested to note that both alt locations are listed in the XA tag, but only the alignment to chr17_KI270908v1_alt is present in the BAM file. I would suggest reporting this to https://github.com/lh3/bwa/tree/master/bwakit for the developer's comment.
Hi Dario, While we wait to hear back from the developers, we suggest taking a closer look at other potential regions in the BAMs that have an elevated number of reads with post-alt processing tags and MAPQ 0, if you are interested. You can then perform variant calling separately on these regions.
@calliza suggests doing this by extracting the reads that map to the chr17 region from the BAM file, convert them back to fastq format and remap them using bwa-mem without post-alt processing.
Let us know how you go, we're keen to understand the impact this issue might have.
I tried out the suggestion with GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (as recommended by Heng Li) and now all of the reads have high mapping quality (i.e. MAPQ 60) across the entire region. So, the MAPQ 0 problem seems to be caused simply by having alternate contigs (with identical DNA stretches) in the reference and bwa-postalt.js seemingly being buggy.
Let's wait to hear what Heng Li has to say regarding https://github.com/lh3/bwa/issues/342 before assuming the bwa-kit is buggy. There may be a sound reason as to why the MAPQ of these reads were not lifted during post-alt processing.
Could create_project.bash
offer an easy way to switch between reference genomes with and without alternate contigs? I am working on a new project where accurate copy numbers are a higher priority than HLA typing, for instance.
Thanks for the suggestion Dario. The simplest solution would be to clone a new repo and re-align using the different reference genome. As the mapping qualities and tags will be different for reads that have aligned to 'alt', simply removing the alt component of the refseq won't do the trick. Back-tracking to calculate those values without alt contigs is beyond the scope of this pipeline, but sounds like a useful utility to have. Perhaps add it to the bwa-kit github as a suggestion?
Heng Li now recommends that people do not use hs38DH.
Name | Coordinate | Brief description -- | -- | -- hs37 | GRCh37 | 1000 Genomes Project (1000G) reference in 2010 hs37d5 | GRCh37 | hs37 with decoy (recommended GRCh37) hs38 | GRCh38 | GRCh38 no-alt analysis set (recommended GRCh38) hs38DH | GRCh38 | GRCh38 with ALT, decoy and HLA genes (not recommended) chm13v2 | CHM13v2 | T2T CHM13 plus HG002 chrY (recommended CHM13)hs38DH: This genome includes GRCh38 alternate contigs, GRCh38 decoy sequences and HLA alleles. As GRCh38 is more complete than GRCh37, GRCh38 decoy sequences are not as important as GRCh37 decoy. Furthermore, most tools would not work well with this version as they are not ALT-aware. Improper use of this genome would hurt variant sensitivity.
Hi Dario. Heng Li has recomemnded against this genome ONLY IF the user is not using ALT-aware tools. this repository contains a workflow that uses an alt-aware tool (bwa-kit) so use of hs38DH is OK.
We have not forgotten about this issue, and do have plans to investigate further. I apologise its taking so long, but as you can understand we have a large workload with a small team.
I will let you know once we have an answer for you regarding the original issue.
Excellent. There is certainly a bioinformatics staff shortage everywhere.
The majority of reads (particularly over exons) in a region of chr17 appear to have MAPQ being zero.
Copying and pasting one of their sequences into UCSC BLAT, it becomes apparent that they also map perfectly to ALT contigs.
But, the MAPQ there is also all zero (reads are white, not dark grey).
Isn't
bwa-postalt.js
supposed to lift-over these alignments to the main chromosomes and increase their MAPQ?It looks like SNVs in hg38 regions with alternate contigs might always be missed. @r0sies @tracychew can you investigate?