Closed holtjma closed 2 years ago
hi, thanks for the clear report. --unknown/-u
is not ideal and just provides a best-effort implementation for when only single-sample VCFs are available. that said, i think it could be improved. i agree that is is surprising that the hom_alt counts change so much.
any change you could share the vcf files for those 3 samples?
I'll see what I can do. Assuming yes, same email as last time?
please and thanks. that would be great.
Sent, interested to know what you find out! (Hopefully, I didn't do something stupid haha)
just to follow up on this and summarize what Matt Holt and I discussed via email:
in order to allow comparisons between genome builds (which can change reference alleles), somalier sorts alphabetically so that the lowest allele is always first. Because of how this is handled and the inclusion of options that allow --unknown to be treated as reference and that some GVCFs do not indicate the alternate, this can be inconsistent for various combinations of options. This is (now) a known limitation.
Hello,
I ran into a strange difference in output between gVCF and VCF runs of the same samples and was hoping to get some input into how it happens.
Background: I've been using strelka2 which outputs both a gVCF and VCF file per sample (in my tests, the samples were run individually). I then run
somalier extract
to pull out the variants and then do asomalier relate
where the only difference is that I add a-u
for the VCF version to account for0/0
calls essentially being removed from the VCF.Issue: When I look at the outputs, I get drastically different numbers isolated to the
hom_alts
related fields, enough that for particular samples it can drastically skew relatedness. Here's an example of the file using gVCF inputs and no-u
in the relate (I've clipped it for brevity, first row are replicates, second row completely unrelated samples):And here's the values from the same
strelka2
runs, only using the VCFs and the-u
option during the relate step:There's a huge jump in relatedness (-0.004 -> 0.274) and the obvious culprit is the change the
hom
related fields. I'm planning to simply use the gVCF files going forward, but I wasn't able to really explain why the output was so drastically different given that these are from the exact same runs of the software. Thoughts on how to proceed?