Closed MattWellie closed 1 month ago
Hey @MattWellie thanks for reporting this. I do suspect this has something to do with how the input data is formatted here, specifically the genotype (GT
) fields. In gatk-sv, genotypes are always diploid even on sex chromosomes (e.g. 0/1
for hemizygous genotypes and ./.
for females on Y). Admittedly this is not correct, but for historical reasons we still maintain this convention and the annotation assumes as much on the input.
I suspect that the data you're using is using the correct convention where GT
reflects the ploidy (i.e. 0
or 1
for haploid genotypes). Can you check the VCF for records that might use that convention? I'm not sure how they would get in there unless you did something a little off the rails.
Hi Mark, thanks for replying. I was missing the forest for the trees a bit here: The data we're having problems with is generated from GATK-gCNV, not from this pipeline. As we've implemented GATK-SV for genomes already, we wanted to share a common annotation endpoint for our exome and genome pipelines. That is the reason we're now hitting this AD counting issue when it was previously A-Ok.
If it doesn't cause you any issues we'd love for this to be patched at source, otherwise we can implement a buffer stage in our pipeline to translate all these haplo genotypes into the GATK'y representation before running through annotation
IMO, if you are mixing GATK-SV and GATK-gCNV calls it would be better to harmonize everything with one genotype convention, so a formatting script for gCNV->GATK-SV sex chromosome genotype conversion might be the best way to go. Unfortunately I don't think we have any scripts available for this purpose but it shouldn't be too hard to implement.
I'd definitely be open to fixing at the source if you want to create a PR for that, but I don't think we can dedicate any effort to that right now.
I think this PR itself is a minim fix. The round shouldn't affect your GATK sv genotypes as they're already ints. I've test run this script locally to show that it fixes my CNV case, I'll check it produces identical gatksv output
Ah I didn't even realize this was a PR! I think the fix is a bit more complicated, however. I think you would need to know a priori which convention is being used for GT and adjust the calculation for AN accordingly. You could modify the script to take in a flag that switches the behavior.
@MattWellie did you want to follow up on this or should we close this out? I think a proper fix will be a bit involved so if you have a workaround it would probably be easier for you to do that instead.
Hi @mwalker174,
Just to clarify from before we have two separate pipelines which use this GATK-SV annotation stage, but the results are never intermingled. The gCNV/exome usage is running into this problem, the GATK-SV/genomes are fine. I don't think it's important, but just wanted to clear that up.
I've overcome this issue by taking your sv-pipeline docker image and patching this script with a working copy that rounds the AN value before assigning it to the pysam AN
. That's solved this problem for our purposes, so if this is a patch you don't need I'm happy to close this.
Ok I think it makes sense to close this out but keep #663 open.
Docker image this failure was identified in: australia-southeast1-docker.pkg.dev/cpg-common/images/sv/sv-pipeline:2023-09-13-v0.28.3-beta-af8362e3
Relevant Workflow: ShardedAnnotateVcf.ComputeAFs https://github.com/broadinstitute/gatk-sv/blob/d37e453038e425acba9683da503218fe3d4b1033/wdl/ShardedAnnotateVcf.wdl#L146
I've recently seen deterministic failures during AnnotateVCF running on CNV data. I've tracked the error to this line -
if Hemi (chrX & MALE & non-par), AN = AN/2. If the AN was an odd number, the result is a float. This value is added to the Variant entity as
MALE_AN
, which is described in the header as an INTEGER. PySam doesn't support adding a float value to an Integer field:There is an additional later call implicated here,
adj_an = m_an + f_an
, where if the male AN is a float, the overallAN
field would become a float, though this single change also addresses that, asMALE_AN
is fixed as an Integer.I'm not sure if I'm running into this now due to some weird data, previous data run through the same pipeline has never hit this issue, but based on the data I have this is a possible point of failure and I'm surprised nobody has hit it yet.