Closed ghost closed 4 years ago
Hi,
Any VCF parser should be able to load the GLnexus pVCF at least syntactically. Let us know if you find this not to be the case. If you specify the programming language maybe I can provide a general suggestion from past experience.
I say "at least syntactically" because there are a lot of subtleties in how VCF details can be interpreted, especially when there are information dependencies between different records. This is a general challenge for the field.
To find singletons I think you would look for rows where the GT fields call an alternate allele exactly once across the samples (e.g GT=0/1 or GT=0/2 or GT=./1 or GT=./2). Whether the other samples have GT=0/0 or GT=./. is less relevant.
Keep in mind this will find some bona fide singletons but, given the noise inherent in these experiments, will also tend to enrich for false-positive or otherwise problematic variant calls.
Thanks, in fact, since I am only interested in Singletons, is the representation from GLnexus the best suited? Compared with, let's say, a "simple" bcftools intersect.
If I understood you well
the T:C is called with 0/1 but as it also has a sample where it is 1/1 it should be rejected as a singleton
Whereas the monoallelic should be kept since it has ./1 and ./. everywhere else.
So the other values can only be ./. or ./0 or 0/0 right? (I am not aware of a 0/. GT) I think it's important to know exactly what the other samples might have, for parsing purposes. Like, in pseudo-code assuming 2 samples
if ($GT1 = 0/1 or 0/2 or ./1 or ./2) (AND $GT2 = 0/0 or ./. or ./0) (note sur for the ./0).
As for programming languages, Perl, Python and AWK are okay, as well as R I guess
I tried with the package vcfR, I talk about it here (in case you are interested): https://github.com/knausb/vcfR/issues/152
Hello,
I was just wondering if there was any recommended tool for parsing the pVCF? Of course it is always possible to write on to extract the relevant fields. If I am interested into de novo variants that exist in 1 sample, and is not shared with any other sample, what would be the genotype assigned by GLnexus? Am I right to assume I should look for rows where the genotypes is ./. everywhere but in a single sample?
thanks a lot