Open ghost opened 4 years ago
One line per base would lead to impractically large output files for GLnexus' main use cases. There is a proposal under discussion in GA4GH about standardizing a multi-sample GVCF format, which would summarize reference coverage in between variant sites. We are monitoring developments there but it will take some time yet to work its way through that process.
I was asking because it seems GATK joint caller has an "all sites" option. I understand, however, that GLnexus has a strong emphasis on computation efficiency
Yea it's not something GLnexus' main users have requested or would seem likely to use. If you were really dedicated, you could synthesize a GVCF exhibiting a fake variant with good quality metrics at every position and feed that in, causing GLnexus to generate a pVCF site for every position. (I'm not recommending this to be clear -- I think it would work in principle, but there are always unforeseen problems)
Okay, just in case that option might be useful for people calculating mutation rates, as we divide by the genome size, but actually we divide by the number of "callable bases" in the genome, i.e sites that are homozygous but that wouldn't have been filtered out if they hadn't been homozygous.
Thanks -- happy to leave this ticket open for others to +1 or comment
This feature will be of importance to calculate some population genomic statistics that are sensitive to total base pair mapped, like Pi.
See here: https://pixy.readthedocs.io/en/latest/generating_invar/generating_invar.html
Hello,
Is there any way to output the homozygous reference bases in the pVCF? Can I have a pVGCF, with one line per base in my reference genome?
thanks