brentp / cyvcf2

cython + htslib == fast VCF and BCF processing
MIT License
382 stars 72 forks source link

Dealing with missing values in INFO column #279

Closed gnxsf closed 1 year ago

gnxsf commented 1 year ago

Often times, GATK cannot calculate a particular annotation value for every variant site. When this happens, GATK uses a 'dot' to represent the missing value (e.g. BaseQRankSum=.). When this occurs, cyvcf2 throws a KeyError (see error output below). Is there a way to discard any sites that have this missing value so that the numeric values of other variant sites can still be extracted?

The macro formula I'm using is:

@continuous
def BaseQRankSum(variant):
    return variant.INFO["BaseQRankSum"]

The error output is:

Traceback (most recent call last):
  File "/usr/local/bin/vcfstats", line 6, in <module>
    sys.exit(main())
  File "/vcfstats/vcfstats/cli.py", line 192, in main
    instance.iterate(variant)
  File "/vcfstats/vcfstats/instance.py", line 181, in iterate
    self.formula.run(variant, self.data.append, self.data.extend)
  File "/vcfstats/vcfstats/formula.py", line 319, in run
    self.Y.run(variant, self.passed),
  File "/vcfstats/vcfstats/formula.py", line 131, in run
    value = self.term["func"](variant)
  File "/mnt/info_macros.py", line 30, in BaseQRankSum
    return variant.INFO["BaseQRankSum"]
  File "cyvcf2/cyvcf2.pyx", line 2174, in cyvcf2.cyvcf2.INFO.__getitem__
KeyError: b'BaseQRankSum'
brentp commented 1 year ago

you can use variant.INFO.get("BaseQRankSum") which will return None if it's not present. Is this what you need?

gnxsf commented 1 year ago

Yes, this is perfect. Thank you for such a quick response!