Any recommended parser?

dnanexus-rnd / GLnexus

Scalable gVCF merging and joint variant calling for population sequencing projects

Apache License 2.0

145 stars 37 forks source link

Any recommended parser? #209

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello,

I was just wondering if there was any recommended tool for parsing the pVCF? Of course it is always possible to write on to extract the relevant fields. If I am interested into de novo variants that exist in 1 sample, and is not shared with any other sample, what would be the genotype assigned by GLnexus? Am I right to assume I should look for rows where the genotypes is ./. everywhere but in a single sample?

thanks a lot

mlin commented 4 years ago

Hi,

Any VCF parser should be able to load the GLnexus pVCF at least syntactically. Let us know if you find this not to be the case. If you specify the programming language maybe I can provide a general suggestion from past experience.

I say "at least syntactically" because there are a lot of subtleties in how VCF details can be interpreted, especially when there are information dependencies between different records. This is a general challenge for the field.

To find singletons I think you would look for rows where the GT fields call an alternate allele exactly once across the samples (e.g GT=0/1 or GT=0/2 or GT=./1 or GT=./2). Whether the other samples have GT=0/0 or GT=./. is less relevant.

Keep in mind this will find some bona fide singletons but, given the noise inherent in these experiments, will also tend to enrich for false-positive or otherwise problematic variant calls.

ghost commented 4 years ago

Thanks, in fact, since I am only interested in Singletons, is the representation from GLnexus the best suited? Compared with, let's say, a "simple" bcftools intersect.

If I understood you well

the T:C is called with 0/1 but as it also has a sample where it is 1/1 it should be rejected as a singleton

Whereas the monoallelic should be kept since it has ./1 and ./. everywhere else.

So the other values can only be ./. or ./0 or 0/0 right? (I am not aware of a 0/. GT) I think it's important to know exactly what the other samples might have, for parsing purposes. Like, in pseudo-code assuming 2 samples

if ($GT1 = 0/1 or 0/2 or ./1 or ./2) (AND $GT2 = 0/0 or ./. or ./0) (note sur for the ./0).

As for programming languages, Perl, Python and AWK are okay, as well as R I guess

ghost commented 4 years ago

I tried with the package vcfR, I talk about it here (in case you are interested): https://github.com/knausb/vcfR/issues/152