Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

writeVcf Output Causes IGV Errors from htsjdk #65

Open DarioS opened 1 year ago

DarioS commented 1 year ago

I created a subset file with seven samples out of two thousand and saved it to disk. IGV won't open it.

htsjdk.tribble.TribbleException: The provided VCF file is malformed at approximately line number 3428: empty alleles are not permitted in VCF records, for input source

Using head and tail, the variant looks like

$ zcat example.vcf.bgz | head -n 3428 | tail -n 1
chr1    10150   chr1:10150:CTA:CA       CTA     CA,,TTA 1035.73 PASS

The original file does open successfully in IGV and looks like ($ zgrep 10150 example.vcf.gz | head -n 1)

chr1    10150   chr1:10150:CTA:CA       CTA     CA,*,TTA        1035.73 PASS

Should the package preserve the asterisks as they are and not blank them like it currently does?

hpages commented 1 year ago

A deeper problem with the way VariantAnnotation handles the ALT field is that it replaces both the dots (.) and asterisks (*) with an empty string at load time, making it impossible to distinguish between a "no-alleles" situation and a deletion situation. For example, with a VCF file that contains the following lines:

#CHROM  POS     ID      REF     ALT     ...
20      17330   .       T       A       ...
20      20500   .       C       *       ...
20      1110696 .       A       G,T     ...
20      1230237 .       T       .       ...
20      1234567 .       GTC     G,*,GTCT        ...

ALT <- alt(readVcf(...)) will look like this (after turning it into a list of ordinary character vectors with as.list(as(ALT, "CharacterList"))):

[[1]]
[1] "A"

[[2]]
[1] ""

[[3]]
[1] "G" "T"

[[4]]
[1] ""

[[5]]
[1] "G"    ""     "GTCT"

Unfortunately ALT[[2]] and ALT[[4]] are both represented by an empty string so the information of whether these are a "no alleles" or a deletion is lost :disappointed:

Should the package preserve the asterisks as they are and not blank them like it currently does?

It could, but that's not necessarily the easiest way to go.

I suggest that we don't change how asterisks (i.e. deletions) get loaded i.e. they still get loaded as empty strings but these empty strings are replaced back with asterisks at write time. However, that means that we need to change the way we handle the dots (i.e. no alleles). One possibility is to preserve them. Another possibility is to represent the "no-alleles" situation with an empty (zero-length) DNAStringSet object or character vector (not the same as an empty string!). So for the above example, ALT would look like this:

> ALT
DNAStringSetList of length 5
[[1]] A
[[2]] 
[[3]] G T
[[4]] DNAStringSet object of length 0
[[5]] G  GTCT

> as.list(as(ALT, "CharacterList"))
[[1]]
[1] "A"

[[2]]
[1] ""

[[3]]
[1] "G" "T"

[[4]]
character(0)

[[5]]
[1] "G"    ""     "GTCT"

Then those zero-length DNAStringSet objects or character vectors would be replaced back with dots at write time.

IMO using a zero-length DNAStringSet object or character vector makes more sense than preserving the dots at load time. If there are no alleles, then the DNAStringSet object or character vector that represents them is empty. It's very natural and keeps things consistent because then the number of alleles is always equal to the length of the DNAStringSet object or character vector that is used to represent them.

H.

hpages commented 1 year ago

@DarioS, @lawremi, @mtmorgan, @vjcitn: Anybody wants to try to come up with a fix? I'm not a co-author of VariantAnnotation and I don't maintain it either. Was just sharing my analysis of the problem above, with suggestions on how to address.

DarioS commented 1 year ago

That seems like a suitable solution at first glance. I could implement it, unless someone else thinks that there is a better way.

vjcitn commented 1 year ago

I'd suggest that @DarioS take a shot at it, it sounds like a clear plan to me. When the PR is available I will review it and push.

DarioS commented 1 year ago

The code which changes asterisks into empty strings is easy to find. It is flat[grepl("*", flat, fixed = TRUE)] <- "" in function .formatALT. But the dots seem to be changed into empty strings in C code, which I have not worked with since I was an undergraduate student. The returned result of invocation of C function scan_vcf_character has the conversion made but I can't see in which statement that would happen. Can anyone experienced in C programming identify it?