Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Presentation of missing ALT alleles #20

Closed vobencha closed 5 years ago

vobencha commented 5 years ago

Opening this issue in response to question in https://github.com/Bioconductor/VariantAnnotation/issues/8#issuecomment-392411301.

vobencha commented 5 years ago

@d-cameron , Your original question:

"Why does an ALT missing allele (.) return an empty string and not NA_character when the VCF file contains variants with symbolic alleles? The empty string looks very much like a 1bp deletion that doesn't quite conform to the VCF specifications. I would have thought it makes more sense to return NA in this situation."

I've had to go back to the man pages and code to remember what was done. On the readVcf man page we've made a distinction between * and I:

      REF is returned as a DNAStringSet. ALT will be a character vector for
      structural variants and a DNAStringSetList otherwise. '*' characters
      are treated as missings and become an empty character in a
      DNAStringSetList. 'I' characters are treated as undefined and become
      a '.' in a DNAStringSetList.

And according to the VCF 4.2 spec, section 1.4.1, it doesn't look like dot is an option in the ALT field, only A, C, G, T, N, *.

Can you explain more about where you're seeing this problem, maybe with specific variants or an older VCF format version?

d-cameron commented 5 years ago

Can you explain more about where you're seeing this problem, maybe with specific variants or an older VCF format version?

I personally am not seeing it, but I've become a regular contributor to the VCF specifications themselves so am aware of the issues that get raised and the changes that are in the works. In this particular case, . is a valid value for the ALT field, but the specs themselves are not worded particular clearly.

VCFv4.2 section 1.4.1 intro text states:

There are 8 fixed fields per record. All data lines are tab-delimited. In all cases, missing values are specified with a dot (‘.’).

Further down in the 1.4.1.6 we have:

If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant).

Edit: clarified that . is not actually an ALT allele

d-cameron commented 5 years ago

Note that technically . not an ALT allele itself, but it indicates that the ALT field is missing any value (e.g. unable to call due to insufficient coverage). This means that a missing value . must appear by itself in the ALT field and values such as A,C,. are not valid since it's the entire field the missing value applies to, not a single ALT allele (this is covered by the <*> ALT allele).

vobencha commented 5 years ago

Thanks @d-cameron . I think the choice of representation is partially due to NAs not being supported in DNAStringSetList but here is some additional background.

The last conversation I remember about this was in 2015 with Michael @lawremi and Vince @vjcitn . IIRC, we wanted to distinguish between ALTs that were missing vs undefined, specifically these cases

The conversation resulted in this commit:

commit c6873c0dd5c63daec0402c73fc4934abdc527499
Author: Valerie Obenchain <vobencha@fhcrc.org>
Date:   Tue Sep 22 17:52:43 2015 +0000

    vcf parsing:

    - '*' handled as missing ("")
    - 'I' handled as undefined (".")

So it was intentional that '.' and '*' are represented differently from 'I'. I think an empty character was chosen instead of NA for the missing values so they could be represented in a DNAStringSet, i.e.,

> DNAStringSetList(NA)
Error in width(x) : NAs in 'x' are not supported

I don't see the 'I' character mentioned as a valid ALT value in the vcf 4.2 specs ... possibly it was output from a (user's) variant caller being used at the time this change was made.

vobencha commented 5 years ago

@d-cameron Any more comments on this or can we close this out?

d-cameron commented 5 years ago

I think an empty character was chosen instead of NA for the missing values so they could be represented in a DNAStringSet

That does really constrain your choices. Given empty shouldn't happen it does seem a reasonable choice as a NA proxy. The alternative of having different behaviour based on whether an SV is present seems like a terrible idea. Time to close this one out.