broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
160 stars 71 forks source link

Integer-conformant AF values #661

Closed MattWellie closed 1 month ago

MattWellie commented 3 months ago

Docker image this failure was identified in: australia-southeast1-docker.pkg.dev/cpg-common/images/sv/sv-pipeline:2023-09-13-v0.28.3-beta-af8362e3

Relevant Workflow: ShardedAnnotateVcf.ComputeAFs https://github.com/broadinstitute/gatk-sv/blob/d37e453038e425acba9683da503218fe3d4b1033/wdl/ShardedAnnotateVcf.wdl#L146

I've recently seen deterministic failures during AnnotateVCF running on CNV data. I've tracked the error to this line -

if hemi:
    AN = AN / 2

if Hemi (chrX & MALE & non-par), AN = AN/2. If the AN was an odd number, the result is a float. This value is added to the Variant entity as MALE_AN, which is described in the header as an INTEGER. PySam doesn't support adding a float value to an Integer field:

Traceback (most recent call last):
  File "/data/script.py", line 506, in <module>
    main()
  File "/data/script.py", line 498, in main
    newrec = gather_allele_freqs(r, samples_list, males_set, females_set, parbt, pop_dict,
  File "/data/script.py", line 105, in gather_allele_freqs
    calc_allele_freq(record, males_set, prefix='MALE', hemi=True)
  File "/data/script.py", line 195, in calc_allele_freq
    record.info[(prefix + '_' if prefix else '') + 'AN'] = AN
  File "pysam/libcbcf.pyx", line 2586, in pysam.libcbcf.VariantRecordInfo.__setitem__
  File "pysam/libcbcf.pyx", line 687, in pysam.libcbcf.bcf_info_set_value
  File "pysam/libcbcf.pyx", line 592, in pysam.libcbcf.bcf_check_values
TypeError: invalid value for Integer format

There is an additional later call implicated here, adj_an = m_an + f_an, where if the male AN is a float, the overall AN field would become a float, though this single change also addresses that, as MALE_AN is fixed as an Integer.

I'm not sure if I'm running into this now due to some weird data, previous data run through the same pipeline has never hit this issue, but based on the data I have this is a possible point of failure and I'm surprised nobody has hit it yet.

mwalker174 commented 3 months ago

Hey @MattWellie thanks for reporting this. I do suspect this has something to do with how the input data is formatted here, specifically the genotype (GT) fields. In gatk-sv, genotypes are always diploid even on sex chromosomes (e.g. 0/1 for hemizygous genotypes and ./. for females on Y). Admittedly this is not correct, but for historical reasons we still maintain this convention and the annotation assumes as much on the input.

I suspect that the data you're using is using the correct convention where GT reflects the ploidy (i.e. 0 or 1 for haploid genotypes). Can you check the VCF for records that might use that convention? I'm not sure how they would get in there unless you did something a little off the rails.

MattWellie commented 3 months ago

Hi Mark, thanks for replying. I was missing the forest for the trees a bit here: The data we're having problems with is generated from GATK-gCNV, not from this pipeline. As we've implemented GATK-SV for genomes already, we wanted to share a common annotation endpoint for our exome and genome pipelines. That is the reason we're now hitting this AD counting issue when it was previously A-Ok.

If it doesn't cause you any issues we'd love for this to be patched at source, otherwise we can implement a buffer stage in our pipeline to translate all these haplo genotypes into the GATK'y representation before running through annotation

mwalker174 commented 3 months ago

IMO, if you are mixing GATK-SV and GATK-gCNV calls it would be better to harmonize everything with one genotype convention, so a formatting script for gCNV->GATK-SV sex chromosome genotype conversion might be the best way to go. Unfortunately I don't think we have any scripts available for this purpose but it shouldn't be too hard to implement.

I'd definitely be open to fixing at the source if you want to create a PR for that, but I don't think we can dedicate any effort to that right now.

MattWellie commented 3 months ago

I think this PR itself is a minim fix. The round shouldn't affect your GATK sv genotypes as they're already ints. I've test run this script locally to show that it fixes my CNV case, I'll check it produces identical gatksv output

mwalker174 commented 3 months ago

Ah I didn't even realize this was a PR! I think the fix is a bit more complicated, however. I think you would need to know a priori which convention is being used for GT and adjust the calculation for AN accordingly. You could modify the script to take in a flag that switches the behavior.

mwalker174 commented 2 months ago

@MattWellie did you want to follow up on this or should we close this out? I think a proper fix will be a bit involved so if you have a workaround it would probably be easier for you to do that instead.

MattWellie commented 1 month ago

Hi @mwalker174,

Just to clarify from before we have two separate pipelines which use this GATK-SV annotation stage, but the results are never intermingled. The gCNV/exome usage is running into this problem, the GATK-SV/genomes are fine. I don't think it's important, but just wanted to clear that up.

I've overcome this issue by taking your sv-pipeline docker image and patching this script with a working copy that rounds the AN value before assigning it to the pysam AN. That's solved this problem for our purposes, so if this is a patch you don't need I'm happy to close this.

mwalker174 commented 1 month ago

Ok I think it makes sense to close this out but keep #663 open.