Closed NBMueller closed 4 years ago
Thanks for the feature.
somebody with better python experience to review it ?
The INFO header of the generated VCF seems incomplete.
An exemplary INFO line is:
Freq1=*;Freq2=1.0000;Freq3=1.0000;Post1=*;Post2=1.1271;Post3=1.0894;Fvar=1099;Rvar=2;Ftot=1100;Rtot=2;Pval=0.515509;Qval=1
However, the frequency associated INFO header is only:
##INFO=<ID=Frq<X>,Number=1,Type=Float,Description="Frequency of the variant in window <X>">
Do we not need one for Freq1
, ...?
The format of Freq1
is set to Float
, but its value can be *
(which is not a float).
We could use a different float to encode *
(e.g. -1.0
) or switch the datatype to String.
The former might be preferable.
@kpj is there a simple/elegant way to make a field optional in the format? alternatively have a "windows present" bolean for windows 1/2/3 ?
@DrYak: I am not entirely sure. We might be fine with setting Number=.
for "If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’." (https://samtools.github.io/hts-specs/VCFv4.2.pdf).
*
could then be replaced by an empty value.
The Freq and Post fields can contain, besides the normal floats, a '-' for SNV not present in the window or '*' for not enough coverage to tell anything. Most sense, I assume, makes setting '-' to 0 and '*' to nan. But in/for which language? Whitespace is not an option as it is interpreted as 0? In general, I am not sure of how much of an issue this is, as its just the INFO column with arbitrary customer values.
This would certainly be an issue for VCF parsers which read in the INFO as a string
and then try to convert it to float
. I observed this problem with pyVCF
.
We could just set the format to string and leave appropriate conversions to the user.
Adjusted Freq and Post to only contain floats:
'-' is replaced by 0, for '*' the freq
If no pending remarks, I would merge into master.
Only tested in shotgun and snv mode, not with amplicon mode (but adjusted for it)