Closed holtjma closed 2 years ago
Just another note, it seems like if I run somalier directly off of the BAM files, I get the expected results downstream. So it seems to be a disconnect between the specific VCF format and somalier (aka, I'm pretty sure this is not a sample issue).
Yeah, I think I should document so that bam/cram is always preferred. If that's not possible, then GVCF and if a multi-sample VCF is available, that will work great (for that cohort against itself). I tried to properly support when not enough info is available, but it's just too error prone.
If it has lower depth at those sites, they could be excluded. But I just added a note to the readme about this.
Ah okay, if BAM is recommended then I should probably just go with that approach here.
As for the VCF, seems like it doesn't have an AD field, so I don't think it would exclude there?
Closing because of BAM workaround; issue with VCF is unresolved but not particularly relevant anymore
Hello,
I was testing using somalier on a new beta pipeline for long-reads and encountered a particular issue I can't explain. The data we're using is PacBio HiFi and we're currently using the Sentieon DNAScope for variant calling. This produces a VCF file (specifically, not gVCF) so I knew we would need to use --unknown for somalier. However, known relationships between GIAB samples weren't matching.
After looking into the logs, we're getting far fewer variants extract than I would expect:
In comparable VCF for the same sample, we get well over 10k extracted.
The sample is passing benchmarking with flying colors, and I did a quick
bcftools isec
between the VCF and the sites files and got10850
variants matching, which is approximately what I would expect.So the long story short is that I'm at a loss as to why somalier seems to be missing a bunch of variants. Happy to share one of the VCF files somehow (I think I have an email from previous issues somewhere...) if that's the best path forward.