broadinstitute / gnomad_methods

Hail helper functions for the gnomAD project and Translational Genomics Group
https://gnomad.broadinstitute.org
MIT License
87 stars 27 forks source link

Sex plodiy cutoffs: lower cutoff for single Y? #428

Closed vladsavelyev closed 2 years ago

vladsavelyev commented 2 years ago

As the docstring says, the first value of y_ploidy_cutoffs should be "lower cutoff for single Y". Should it be changed in the code as follows?

- sex_stats["xx"].y.mean
-                + (normal_ploidy_cutoff * sex_stats["xx"].y.stdev),
+ sex_stats["xy"].y.mean
+                - (normal_ploidy_cutoff * sex_stats["xy"].y.stdev),

https://github.com/broadinstitute/gnomad_methods/blob/35066ffc01d63ac2d7b20e069ea6703013ae9da7/gnomad/sample_qc/sex.py#L117-L118

ch-kr commented 2 years ago

Great question! The reason this line uses sex_stats["xx"].y.mean is to correctly infer all chromosome Y (chrY) karyotypes, including karyotypes where a sample has no Y chromosome. This value is used twice: https://github.com/broadinstitute/gnomad_methods/blob/master/gnomad/sample_qc/sex.py#L164 and https://github.com/broadinstitute/gnomad_methods/blob/master/gnomad/sample_qc/sex.py#L167.

Also, as a quick note, the ploidy cutoffs from this function should be double checked and tweaked where necessary; this function only provides a guideline for what the cutoffs could be.

Would it be helpful if I added some of these notes into the docstring for get_ploidy_cutoffs?

vladsavelyev commented 2 years ago

That makes sense! Thanks a lot for the explanation.

I think I understood now why the thresholds din't work in my case.

For the context: I'm re-analysing samples that were previously aligned with BWA-MEM and called with GATK 3.5, but now re-aligned with DRAGMAP and DRAGEN-GATK.

For the same female samples, instead of expected XX that I expectedly got before, I'm now getting XXY. The function infers chrY ploidy as 1x, and the coverage that it calculates for chrY to be around 20x, even though almost all variants and reference blocks have DP close to zero.

I looked closer into data for 2 samples with this problem, and found out that it's just one or two very long reference blocks that contribute to the coverage, making it ~20x:

In one sample:

chrY 15882980 . A <NON_REF> . DRAGENHardQUAL END=16635242 GT:DP:GQ:MIN_DP:PL 0/0:556:0:0:0,0,0

In another sample - similar region:

chrY 15363283 . C <NON_REF> . DRAGENHardQUAL END=15886436 GT:DP:GQ:MIN_DP:PL 0/0:712:0:0:0,0,0
chrY 15886743 . A <NON_REF> . DRAGENHardQUAL END=16027941 GT:DP:GQ:MIN_DP:PL 0/0:697:0:0:0,0,0

I looked into this region in IGV, and found a stretch of 62 Ts attracting ~20k reads:

Screen Shot 2022-01-10 at 9 33 01 pm

(For comparison, BWA-MEM version of this sample is also added into IGV, and doesn't have anything mapped there)

It got smoothed out in those long blocks in GVCF likely because of a very coarse quality binning scheme we used for HaplotypeCaller (just a single -GQB argument: gatk HaplotypeCaller -GQB 20). So very long regions like chrY:15363283-15886436 got DP averaged out to 712x just because of one or a few repetitive regions like this with coverage >20k.

I'm not sure yet what could be the best general approach here. I can file this issue to DRAGMAP repo, but wondering if there is a chance that it might happen with BWA-MEM on other data, where some tiny region attracts tons of reads affecting the average DP in calculations. Good news is that those reads have a very low MQ, so using some MQ-adjusted DP could help. Though the ref blocks GVCF annotations don't have this information, there is just an overall DP ("Approximate read depth (reads with MQ=255 or with bad mates are filtered)") and MIN_DP ("Minimum DP observed within the GVCF block").

Let me know if you have ideas!

ldgauthier commented 2 years ago

I understand why the poly-T in the reference is so "attractive", but I don't understand why the mates of those reads don't disqualify them from that locus. I wonder if DragMap has different logic regarding mapping of mates. I'm very concerned about this behavior. Can you get more info on a few of the mates? I believe you can do it from the read details in IGV.

In most regions, DP should correlate really well with GQ, so I would expect 712X to be higher than GQ20, and it would be in a different block from the no coverage GQ0s. I would have suggested MIN_DP, expect the MIN_DP for GQ0 will probably be zero more often than not. We can do pretty reliable Y ploidy inference from exomes, so restricting to some high quality and/or well behaved intervals will probably still yield enough data. In the gs://gcp-public-data--broad-references (in hg38/v0/ or something like that) there should be an interval_list called exome_evaluation_regions, which would be a good place to start for high quality intervals.

vladsavelyev commented 2 years ago

I looked at a bunched of reads mapping to that locus, and they all have a mate mapping somewhere else in the genome (different every time) with a higher MQ, and also often a supplementary alignment mapping somewhere else again (differently from the mate.)

Made screenshots of 3 arbitrary reads:

Read:

A00488:117:HVTVGDSXY:1:2606:1099:33051  113 chrY    15896766    0   3S31M1D29M87S   chr20   61410542    0   TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTGTTTTTTTTTTTTATTTTGTTTTTTTTTTTTGTTTTTGTTTGGATGTTTGTATTGTTGGTTTTTAGTGTTGGATTTTTTGTAGATGTATGTTTTTGTAGTTTGATT  ???????+??????+???+???????+??+???+????++????+?????+?+?+??+???++?+???+?++?+????????????????????5???????????????????????+???+???????'?5?+??+?????+??????  SA:Z:chr13,104356596,+,65H37M48H,0,1;   MC:Z:150M   MQ:i:19 AS:i:48 XS:i:46 mc:i:61410691   ms:i:4764   MD:Z:31^T23T5   NM:i:2  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752

His mate:

A00488:117:HVTVGDSXY:1:2606:1099:33051  177 chr20   61410542    19  150M    chrY    15896766    0   TACATTCCACTCACTGGTCCATCTATCCATCCATTCATCCATCTGTCAATCATCCATCTACACATCCCTCCATCTTCCATTCCTCCCACTGGTCCATCCATCCATCCATCCATCCATCCTCTCACTTGTTAGTCCATTGACCTGTCCATC  ?+??++??+?+?+?+++??+???????????+???+????+?+?+????+????????5????????????????????????????+?+?5??????????????????????????????????????????????????????????  MC:Z:3S31M1D29M87S  MQ:i:0  AS:i:140    XS:i:118    mc:i:15896913   ms:i:4360   MD:Z:1C6T141    NM:i:2  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Screen Shot 2022-01-12 at 4 59 00 pm

Read:

A00488:117:HVTVGDSXY:1:1315:24108:27211 177 chrY    15896766    2   59S60M31S   chr11   77041117    0   GGGGGGTTGTGTGTTGGTGGGGTTTTTTTTTTTTTGTTTTTGGGGGGTTTGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTGTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTAAAAAAAAAAAAAACCCCCCCT  +?+?++++++???+++5+++++??+??????????++?++5+??+++++++?++?+??+??+???????????????+??++++++++?++??++?++???++??+?+++???+?++?+??????++'+???+???5??55+????????  SA:Z:chr12,50924517,+,9H39M102H,0,0;    MC:Z:150M   MQ:i:60 AS:i:50 XS:i:45 mc:i:77041266   ms:i:5526   MD:Z:24T17G17   NM:i:2  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752

Mate:

A00488:117:HVTVGDSXY:1:1315:24108:27211 113 chr11   77041117    60  150M    chrY    15896766    0   CAGCGCCCCCTGCCATGTGCCTACAAATGTCAGTTGTGATTTCCACTGTTTACAAGTGAGTGGAGCTGGAGCTGGGCTGACAGTGTCAGGTGGATCCCGCTTCCCCCTCCCCCAAGAAGTCAGCCAACACGCAGCTGAGGCGCATGTGGT  ??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????  MC:Z:59S60M31S  MQ:i:2  AS:i:145    XS:i:20 mc:i:32866959   ms:i:2626   MD:Z:84A65  NM:i:1  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Screen Shot 2022-01-12 at 5 07 09 pm

Read:

A00488:117:HVTVGDSXY:4:1501:9281:3443   145 chrY    15896766    0   25S61M64S   chrX    114564847   0   TTTACTTTTTTCTTTTTTTTCTTCTTTTTTTTTTTTTTTTTTATTTTTATTTTTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTCTTTTTTCGTTTTTGCGTTCGTCTGCCCCCTCCTCCCCCCCCCCCCCCCCCCCCCCCCC  ??'++?++++++?++++++++?++++?+?+++++++?++++'+++++5+???++?+?+++?+++?+++???+?+++???++?+??+++++++++++++%+++++55%+?+%++++5++++++++++++++++++++++????????????  SA:Z:chr5,112769033,-,6H36M108H,0,0;    MC:Z:151M   MQ:i:60 AS:i:41 XS:i:40 mc:i:114564847  ms:i:5563   MD:Z:17T5T6G11G18   NM:i:4  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752

Mate:

A00488:117:HVTVGDSXY:4:1501:9281:3443   97  chrX    114564847   60  151M    chrY    15896766    0   TTAACCGACACACAGTCAGATTTTAAGCAGTAACCATTTGAAAGGTCACTTTAGAAGCAATGTGGAGAAGTAACTTGAAGAGGTGGGACTGGAAGCAAGAAGAAGTGCTTGGACACTAAAGATGTAATATGCCAGGTGAAGGAGAACATGT ??????5???????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????? MC:Z:25S61M64S  MQ:i:0  AS:i:151    XS:i:19 mc:i:172218716  ms:i:1216   MD:Z:151    NM:i:0  RG:Z:HVTVGDSXY.1-2-3-4.210316_FD09252752
Screen Shot 2022-01-12 at 5 13 55 pm

In BWA, actually, the region looks very similar:

Screen Shot 2022-01-12 at 5 39 14 pm

The GVCFs records resulting from these alignments, however, are different (still low-GQ block, but called variants split it into 6):

bcftools view -H -r chrY:15896761-15896857 DRAGMAP.g.vcf.gz
chrY    15886743    .   A   <NON_REF>   .   DRAGENHardQUAL  END=16027941    GT:DP:GQ:MIN_DP:PL  0/0:697:0:0:0,0,0

bcftools view -H -r chrY:15896761-15896857 BWAMEM.g.vcf.gz 
chrY    15883027    .   A   <NON_REF>   .   .   END=15896771    GT:DP:GQ:MIN_DP:PL  0/0:6:0:0:0,0,0
chrY    15896772    .   T   A,<NON_REF> 0.01    .   .   GT  ./.
chrY    15896773    .   T   G,<NON_REF> 0.01    .   .   GT  ./.
chrY    15896774    .   T   <NON_REF>   .   .   END=15896774    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
chrY    15896775    .   T   G,<NON_REF> 0.01    .   .   GT  ./.
chrY    15896776    .   T   <NON_REF>   .   .   END=16027941    GT:DP:GQ:MIN_DP:PL  0/0:1:0:0:0,0,0
vladsavelyev commented 2 years ago

In most regions, DP should correlate really well with GQ

Shouldn't GQ also depend on MQ of reads? DP would include low-MQ reads:

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">

Resulting in a high DP; However GQ would remain small (as in the poly-T region).


We can do pretty reliable Y ploidy inference from exomes, so restricting to some high quality and/or well behaved intervals will probably still yield enough data

Thanks!!

I worked around this issue by removing regions with a very high coverage (10x times the normalization contig coverage):

mt = mt.filter_entries(mt.DP < 10 * mt.chr20_mean_dp)

And then also recalculating the denominator as the sum of sizes of all unfiltered regions (which however wouldn't work well with excluded_calling_intervals/included_calling_intervals):

f"{chrom}_blocks_total_size": hl.agg.sum(
    hl.cond(
        chr_mt.LGT.is_hom_ref(),
        1 + chr_mt.END - chr_mt.locus.position,
        1,                    
    )
)

It solves the problem. However, your suggestion to whitelist/blacklist regions will be a better approach, as it wouldn't require modifying the gnomad methods codebase. I might take default genome regions, and remove known repeats using a UCSC mask or something.

ldgauthier commented 2 years ago

The BWA-derived GVCF makes more sense to me. Yes, there are GQ0 blocks and calls, but the depth on the ref blocks is very low, which is what I would expect for chrY on a female.

We're seen a lot of discrepancies between DP in comparing DRAGEN with GATK (4.1 and onwards, I believe). Most of these seem to revolve around differences in how low mapping qualities are treated, so I think using regions of "high confidence" would be a robust way to move forward.

vladsavelyev commented 2 years ago

Thanks a lot for your help!

And really thanks for publishing the gnomad package, can't overestimate how useful it was for us so far