Sydney-Informatics-Hub / Fastq-to-BAM

Optimised pipeline to process whole genome sequence data from fastq to BAM on NCI Gadi
GNU General Public License v3.0
19 stars 7 forks source link

Alternate Contig Alignments and Mapping Quality Zero #15

Open DarioS opened 2 years ago

DarioS commented 2 years ago

The majority of reads (particularly over exons) in a region of chr17 appear to have MAPQ being zero.

KANSL1alignments

Copying and pasting one of their sequences into UCSC BLAT, it becomes apparent that they also map perfectly to ALT contigs.

QUERY  SCORE START   END QSIZE IDENTITY  CHROM                 STRAND  START       END   SPAN
-------------------------------------------------------------------------------------------------------------
KANSL1   151     1   151   151   100.0%  chr17_GL000258v2_alt  -      721003    721153    151
KANSL1   151     1   151   151   100.0%  chr17_KI270908v1_alt  +      769645    769795    151
KANSL1   151     1   151   151   100.0%  chr17                 +    46067546  46067696    151

But, the MAPQ there is also all zero (reads are white, not dark grey).

image

Isn't bwa-postalt.js supposed to lift-over these alignments to the main chromosomes and increase their MAPQ?

It reads all potential hits reported in the XA tag, lifts ALT hits to the chromosomal positions using the ALT-to-ref alignment, groups them based on overlaps between their lifted positions, and then re-estimates mapQ across the best scoring hit in each group.

It looks like SNVs in hg38 regions with alternate contigs might always be missed. @r0sies @tracychew can you investigate?

calizilla commented 2 years ago

Hi Dario

I have checked your data and can see that alt-aware mapping was carried out. You have the .alt index file in place, the alt contigs have the AH tag present in the SAM header indicating alt-aware handling, and you have reads with pa tags indicating alt aware handling.

I have checked the alt contig in question and its corresponding location on chr17 primary reference sequence, and there are primary reads with non-zero mapping quality that also map to the alt contig.

Have a look at section 3 here https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/tutorials/8017-how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38 for an example of reads that map to both the primary and alt contig, and a description of the PA tag.

It looks like SNVs in hg38 regions with alternate contigs might always be missed

No this is not correct. You can verify this by counting the number of SNVs in your final VCF in a hg38 primary region with alt contig.

DarioS commented 2 years ago

I still think that this is a real issue. Let's use ST-E00127:578:HJKY7ALXX:4:2106:25723:5089 from CSCC_0004-M1.final.bam as an example. It has three perfect alignments; chr17 and two alternate contigs. You can quickly verify its hg38 alignments using BLAT.

Read sequence

AGCTCAGTACAGGACGTGTCCGGGCTGCCACACAGGTGCCATCAGATGATGAAGAGACGAGATTCAGTCGTTGCTTCTTATGTAATTGTTCCTCAGCATCAGAGCTGTCACCTGGAATGTGGTCTGCCAAGACAGGCTGAAGACTATACAA

Alignments of read to genome

image

It seems to be lifted back to chr17, but it has a MAPQ of 0 on both chr17 and the alt. So, it wouldn't be used for SNV calling nor copy number.

Compare

Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 147
----------------------
Mapping = Primary @ MAPQ 0
Reference span = chr17:46,066,590-46,066,740 (-) = 151bp
Cigar = 151M

to

Read name = ST-E00127:578:HJKY7ALXX:4:2106:25723:5089
Sample = CSCC_0004-M1
Library = CSCC_0004-M1_1
Read group = HJKY7ALXX.2_CSCC_0004-M1_1
Read length = 151bp
Flags = 2195
----------------------
Mapping = Supplementary @ MAPQ 0
Reference span = chr17_KI270908v1_alt:768,689-768,839 (-) = 151bp
Cigar = 151M

It's missing from the third perfect mapping location on chr17_GL000258v2_alt.

We have found problems with copy number variant calling and there would also be a problem with SNVs here in cancer samples which had SNVs in a gene such as KANSL1. I think this explains the enrichment of MAPQ 0 reads over exons (exons are more conserved than introns and more likely to result in score ties) and MAPQ > 0 reads enriched in introns. But, reads like the one above would be fine if hg19 or a transcriptome FASTA reference was used. Do you agree that it is a problem?

calizilla commented 2 years ago

It does appear odd that the mapping quality of the primary alignment has not been assigned > 0. I am also interested to note that both alt locations are listed in the XA tag, but only the alignment to chr17_KI270908v1_alt is present in the BAM file. I would suggest reporting this to https://github.com/lh3/bwa/tree/master/bwakit for the developer's comment.

DarioS commented 2 years ago

O.K. issue reported to Heng Li.

georgiesamaha commented 2 years ago

Hi Dario, While we wait to hear back from the developers, we suggest taking a closer look at other potential regions in the BAMs that have an elevated number of reads with post-alt processing tags and MAPQ 0, if you are interested. You can then perform variant calling separately on these regions.

@calliza suggests doing this by extracting the reads that map to the chr17 region from the BAM file, convert them back to fastq format and remap them using bwa-mem without post-alt processing.

Let us know how you go, we're keen to understand the impact this issue might have.

DarioS commented 2 years ago

I tried out the suggestion with GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz (as recommended by Heng Li) and now all of the reads have high mapping quality (i.e. MAPQ 60) across the entire region. So, the MAPQ 0 problem seems to be caused simply by having alternate contigs (with identical DNA stretches) in the reference and bwa-postalt.js seemingly being buggy. image

calizilla commented 2 years ago

Let's wait to hear what Heng Li has to say regarding https://github.com/lh3/bwa/issues/342 before assuming the bwa-kit is buggy. There may be a sound reason as to why the MAPQ of these reads were not lifted during post-alt processing.

DarioS commented 1 year ago

Could create_project.bash offer an easy way to switch between reference genomes with and without alternate contigs? I am working on a new project where accurate copy numbers are a higher priority than HLA typing, for instance.

calizilla commented 1 year ago

Thanks for the suggestion Dario. The simplest solution would be to clone a new repo and re-align using the different reference genome. As the mapping qualities and tags will be different for reads that have aligned to 'alt', simply removing the alt component of the refseq won't do the trick. Back-tracking to calculate those values without alt contigs is beyond the scope of this pipeline, but sounds like a useful utility to have. Perhaps add it to the bwa-kit github as a suggestion?

DarioS commented 12 months ago

Heng Li now recommends that people do not use hs38DH.

Name | Coordinate | Brief description -- | -- | -- hs37 | GRCh37 | 1000 Genomes Project (1000G) reference in 2010 hs37d5 | GRCh37 | hs37 with decoy (recommended GRCh37) hs38 | GRCh38 | GRCh38 no-alt analysis set (recommended GRCh38) hs38DH | GRCh38 | GRCh38 with ALT, decoy and HLA genes (not recommended) chm13v2 | CHM13v2 | T2T CHM13 plus HG002 chrY (recommended CHM13)

hs38DH: This genome includes GRCh38 alternate contigs, GRCh38 decoy sequences and HLA alleles. As GRCh38 is more complete than GRCh37, GRCh38 decoy sequences are not as important as GRCh37 decoy. Furthermore, most tools would not work well with this version as they are not ALT-aware. Improper use of this genome would hurt variant sensitivity.

calizilla commented 11 months ago

Hi Dario. Heng Li has recomemnded against this genome ONLY IF the user is not using ALT-aware tools. this repository contains a workflow that uses an alt-aware tool (bwa-kit) so use of hs38DH is OK.

We have not forgotten about this issue, and do have plans to investigate further. I apologise its taking so long, but as you can understand we have a large workload with a small team.

I will let you know once we have an answer for you regarding the original issue.

DarioS commented 11 months ago

Excellent. There is certainly a bioinformatics staff shortage everywhere.