cbg-ethz / shorah

Repo for the software suite ShoRAH (Short Reads Assembly into Haplotypes)
GNU General Public License v3.0
39 stars 14 forks source link

VCF output implemented #67

Closed NBMueller closed 4 years ago

NBMueller commented 4 years ago

Only tested in shotgun and snv mode, not with amplicon mode (but adjusted for it)

DrYak commented 4 years ago

Thanks for the feature.

somebody with better python experience to review it ?

kpj commented 4 years ago

The INFO header of the generated VCF seems incomplete.

An exemplary INFO line is:

Freq1=*;Freq2=1.0000;Freq3=1.0000;Post1=*;Post2=1.1271;Post3=1.0894;Fvar=1099;Rvar=2;Ftot=1100;Rtot=2;Pval=0.515509;Qval=1

However, the frequency associated INFO header is only:

##INFO=<ID=Frq<X>,Number=1,Type=Float,Description="Frequency of the variant in window <X>">

Do we not need one for Freq1, ...?

kpj commented 4 years ago

The format of Freq1 is set to Float, but its value can be * (which is not a float). We could use a different float to encode * (e.g. -1.0) or switch the datatype to String. The former might be preferable.

DrYak commented 4 years ago

@kpj is there a simple/elegant way to make a field optional in the format? alternatively have a "windows present" bolean for windows 1/2/3 ?

kpj commented 4 years ago

@DrYak: I am not entirely sure. We might be fine with setting Number=. for "If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’." (https://samtools.github.io/hts-specs/VCFv4.2.pdf). * could then be replaced by an empty value.

NBMueller commented 4 years ago

The Freq and Post fields can contain, besides the normal floats, a '-' for SNV not present in the window or '*' for not enough coverage to tell anything. Most sense, I assume, makes setting '-' to 0 and '*' to nan. But in/for which language? Whitespace is not an option as it is interpreted as 0? In general, I am not sure of how much of an issue this is, as its just the INFO column with arbitrary customer values.

kpj commented 4 years ago

This would certainly be an issue for VCF parsers which read in the INFO as a string and then try to convert it to float. I observed this problem with pyVCF.

We could just set the format to string and leave appropriate conversions to the user.

NBMueller commented 4 years ago

Adjusted Freq and Post to only contain floats: '-' is replaced by 0, for '*' the freq/post is excluded

DrYak commented 4 years ago

If no pending remarks, I would merge into master.