Closed vladsavelyev closed 2 years ago
Great question! The reason this line uses sex_stats["xx"].y.mean
is to correctly infer all chromosome Y (chrY) karyotypes, including karyotypes where a sample has no Y chromosome. This value is used twice: https://github.com/broadinstitute/gnomad_methods/blob/master/gnomad/sample_qc/sex.py#L164 and https://github.com/broadinstitute/gnomad_methods/blob/master/gnomad/sample_qc/sex.py#L167.
sex_stats["xx"].y.mean
. This is to check whether a sample does not have a single Y chromosome (catching X, XX, XXX karyotypes, for example).sex_stats["xx"].y.mean
as the lower chrY ploidy cutoff (and this part of the case statement checks for whether a sample has a single Y chromosome).Also, as a quick note, the ploidy cutoffs from this function should be double checked and tweaked where necessary; this function only provides a guideline for what the cutoffs could be.
Would it be helpful if I added some of these notes into the docstring for get_ploidy_cutoffs
?
That makes sense! Thanks a lot for the explanation.
I think I understood now why the thresholds din't work in my case.
For the context: I'm re-analysing samples that were previously aligned with BWA-MEM and called with GATK 3.5, but now re-aligned with DRAGMAP and DRAGEN-GATK.
For the same female samples, instead of expected XX that I expectedly got before, I'm now getting XXY. The function infers chrY ploidy as 1x, and the coverage that it calculates for chrY to be around 20x, even though almost all variants and reference blocks have DP close to zero.
I looked closer into data for 2 samples with this problem, and found out that it's just one or two very long reference blocks that contribute to the coverage, making it ~20x:
In one sample:
chrY 15882980 . A <NON_REF> . DRAGENHardQUAL END=16635242 GT:DP:GQ:MIN_DP:PL 0/0:556:0:0:0,0,0
In another sample - similar region:
chrY 15363283 . C <NON_REF> . DRAGENHardQUAL END=15886436 GT:DP:GQ:MIN_DP:PL 0/0:712:0:0:0,0,0
chrY 15886743 . A <NON_REF> . DRAGENHardQUAL END=16027941 GT:DP:GQ:MIN_DP:PL 0/0:697:0:0:0,0,0
I looked into this region in IGV, and found a stretch of 62 Ts attracting ~20k reads:
(For comparison, BWA-MEM version of this sample is also added into IGV, and doesn't have anything mapped there)
It got smoothed out in those long blocks in GVCF likely because of a very coarse quality binning scheme we used for HaplotypeCaller (just a single -GQB
argument: gatk HaplotypeCaller -GQB 20
). So very long regions like chrY:15363283-15886436 got DP averaged out to 712x just because of one or a few repetitive regions like this with coverage >20k.
I'm not sure yet what could be the best general approach here. I can file this issue to DRAGMAP repo, but wondering if there is a chance that it might happen with BWA-MEM on other data, where some tiny region attracts tons of reads affecting the average DP in calculations. Good news is that those reads have a very low MQ, so using some MQ-adjusted DP could help. Though the ref blocks GVCF annotations don't have this information, there is just an overall DP ("Approximate read depth (reads with MQ=255 or with bad mates are filtered)"
) and MIN_DP ("Minimum DP observed within the GVCF block"
).
Let me know if you have ideas!
I understand why the poly-T in the reference is so "attractive", but I don't understand why the mates of those reads don't disqualify them from that locus. I wonder if DragMap has different logic regarding mapping of mates. I'm very concerned about this behavior. Can you get more info on a few of the mates? I believe you can do it from the read details in IGV.
In most regions, DP should correlate really well with GQ, so I would expect 712X to be higher than GQ20, and it would be in a different block from the no coverage GQ0s. I would have suggested MIN_DP, expect the MIN_DP for GQ0 will probably be zero more often than not. We can do pretty reliable Y ploidy inference from exomes, so restricting to some high quality and/or well behaved intervals will probably still yield enough data. In the gs://gcp-public-data--broad-references (in hg38/v0/ or something like that) there should be an interval_list called exome_evaluation_regions
, which would be a good place to start for high quality intervals.
I looked at a bunched of reads mapping to that locus, and they all have a mate mapping somewhere else in the genome (different every time) with a higher MQ, and also often a supplementary alignment mapping somewhere else again (differently from the mate.)
Made screenshots of 3 arbitrary reads:
Read:
A00488:117:HVTVGDSXY:1:2606:1099:33051 113 chrY 15896766 0 3S31M1D29M87S chr20 61410542 0 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTGTTTTTTTTTTTTATTTTGTTTTTTTTTTTTGTTTTTGTTTGGATGTTTGTATTGTTGGTTTTTAGTGTTGGATTTTTTGTAGATGTATGTTTTTGTAGTTTGATT ???????+??????+???+???????+??+???+????++????+?????+?+?+??+???++?+???+?++?+????????????????????5???????????????????????+???+???????'?5?+??+?????+?????? SA:Z:chr13,104356596,+,65H37M48H,0,1; MC:Z:150M MQ:i:19 AS:i:48 XS:i:46 mc:i:61410691 ms:i:4764 MD:Z:31^T23T5 NM:i:2 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
His mate:
A00488:117:HVTVGDSXY:1:2606:1099:33051 177 chr20 61410542 19 150M chrY 15896766 0 TACATTCCACTCACTGGTCCATCTATCCATCCATTCATCCATCTGTCAATCATCCATCTACACATCCCTCCATCTTCCATTCCTCCCACTGGTCCATCCATCCATCCATCCATCCATCCTCTCACTTGTTAGTCCATTGACCTGTCCATC ?+??++??+?+?+?+++??+???????????+???+????+?+?+????+????????5????????????????????????????+?+?5?????????????????????????????????????????????????????????? MC:Z:3S31M1D29M87S MQ:i:0 AS:i:140 XS:i:118 mc:i:15896913 ms:i:4360 MD:Z:1C6T141 NM:i:2 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Read:
A00488:117:HVTVGDSXY:1:1315:24108:27211 177 chrY 15896766 2 59S60M31S chr11 77041117 0 GGGGGGTTGTGTGTTGGTGGGGTTTTTTTTTTTTTGTTTTTGGGGGGTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTGTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAACCCCCCCT +?+?++++++???+++5+++++??+??????????++?++5+??+++++++?++?+??+??+???????????????+??++++++++?++??++?++???++??+?+++???+?++?+??????++'+???+???5??55+???????? SA:Z:chr12,50924517,+,9H39M102H,0,0; MC:Z:150M MQ:i:60 AS:i:50 XS:i:45 mc:i:77041266 ms:i:5526 MD:Z:24T17G17 NM:i:2 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Mate:
A00488:117:HVTVGDSXY:1:1315:24108:27211 113 chr11 77041117 60 150M chrY 15896766 0 CAGCGCCCCCTGCCATGTGCCTACAAATGTCAGTTGTGATTTCCACTGTTTACAAGTGAGTGGAGCTGGAGCTGGGCTGACAGTGTCAGGTGGATCCCGCTTCCCCCTCCCCCAAGAAGTCAGCCAACACGCAGCTGAGGCGCATGTGGT ?????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:59S60M31S MQ:i:2 AS:i:145 XS:i:20 mc:i:32866959 ms:i:2626 MD:Z:84A65 NM:i:1 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Read:
A00488:117:HVTVGDSXY:4:1501:9281:3443 145 chrY 15896766 0 25S61M64S chrX 114564847 0 TTTACTTTTTTCTTTTTTTTCTTCTTTTTTTTTTTTTTTTTTATTTTTATTTTTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTCTTTTTTCGTTTTTGCGTTCGTCTGCCCCCTCCTCCCCCCCCCCCCCCCCCCCCCCCCC ??'++?++++++?++++++++?++++?+?+++++++?++++'+++++5+???++?+?+++?+++?+++???+?+++???++?+??+++++++++++++%+++++55%+?+%++++5++++++++++++++++++++++???????????? SA:Z:chr5,112769033,-,6H36M108H,0,0; MC:Z:151M MQ:i:60 AS:i:41 XS:i:40 mc:i:114564847 ms:i:5563 MD:Z:17T5T6G11G18 NM:i:4 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Mate:
A00488:117:HVTVGDSXY:4:1501:9281:3443 97 chrX 114564847 60 151M chrY 15896766 0 TTAACCGACACACAGTCAGATTTTAAGCAGTAACCATTTGAAAGGTCACTTTAGAAGCAATGTGGAGAAGTAACTTGAAGAGGTGGGACTGGAAGCAAGAAGAAGTGCTTGGACACTAAAGATGTAATATGCCAGGTGAAGGAGAACATGT ??????5???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:25S61M64S MQ:i:0 AS:i:151 XS:i:19 mc:i:172218716 ms:i:1216 MD:Z:151 NM:i:0 RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
In BWA, actually, the region looks very similar:
The GVCFs records resulting from these alignments, however, are different (still low-GQ block, but called variants split it into 6):
bcftools view -H -r chrY:15896761-15896857 DRAGMAP.g.vcf.gz
chrY 15886743 . A <NON_REF> . DRAGENHardQUAL END=16027941 GT:DP:GQ:MIN_DP:PL 0/0:697:0:0:0,0,0
bcftools view -H -r chrY:15896761-15896857 BWAMEM.g.vcf.gz
chrY 15883027 . A <NON_REF> . . END=15896771 GT:DP:GQ:MIN_DP:PL 0/0:6:0:0:0,0,0
chrY 15896772 . T A,<NON_REF> 0.01 . . GT ./.
chrY 15896773 . T G,<NON_REF> 0.01 . . GT ./.
chrY 15896774 . T <NON_REF> . . END=15896774 GT:DP:GQ:MIN_DP:PL 0/0:0:0:0:0,0,0
chrY 15896775 . T G,<NON_REF> 0.01 . . GT ./.
chrY 15896776 . T <NON_REF> . . END=16027941 GT:DP:GQ:MIN_DP:PL 0/0:1:0:0:0,0,0
In most regions, DP should correlate really well with GQ
Shouldn't GQ also depend on MQ of reads? DP would include low-MQ reads:
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
Resulting in a high DP; However GQ would remain small (as in the poly-T region).
We can do pretty reliable Y ploidy inference from exomes, so restricting to some high quality and/or well behaved intervals will probably still yield enough data
Thanks!!
I worked around this issue by removing regions with a very high coverage (10x times the normalization contig coverage):
mt = mt.filter_entries(mt.DP < 10 * mt.chr20_mean_dp)
And then also recalculating the denominator as the sum of sizes of all unfiltered regions (which however wouldn't work well with excluded_calling_intervals
/included_calling_intervals
):
f"{chrom}_blocks_total_size": hl.agg.sum(
hl.cond(
chr_mt.LGT.is_hom_ref(),
1 + chr_mt.END - chr_mt.locus.position,
1,
)
)
It solves the problem. However, your suggestion to whitelist/blacklist regions will be a better approach, as it wouldn't require modifying the gnomad methods codebase. I might take default genome regions, and remove known repeats using a UCSC mask or something.
The BWA-derived GVCF makes more sense to me. Yes, there are GQ0 blocks and calls, but the depth on the ref blocks is very low, which is what I would expect for chrY on a female.
We're seen a lot of discrepancies between DP in comparing DRAGEN with GATK (4.1 and onwards, I believe). Most of these seem to revolve around differences in how low mapping qualities are treated, so I think using regions of "high confidence" would be a robust way to move forward.
Thanks a lot for your help!
And really thanks for publishing the gnomad package, can't overestimate how useful it was for us so far
As the docstring says, the first value of
y_ploidy_cutoffs
should be "lower cutoff for single Y". Should it be changed in the code as follows?https://github.com/broadinstitute/gnomad_methods/blob/35066ffc01d63ac2d7b20e069ea6703013ae9da7/gnomad/sample_qc/sex.py#L117-L118