brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

gVCF v. VCF outputs #39

Closed holtjma closed 2 years ago

holtjma commented 4 years ago

Hello,

I ran into a strange difference in output between gVCF and VCF runs of the same samples and was hoping to get some input into how it happens.

Background: I've been using strelka2 which outputs both a gVCF and VCF file per sample (in my tests, the samples were run individually). I then run somalier extract to pull out the variants and then do a somalier relate where the only difference is that I add a -u for the VCF version to account for 0/0 calls essentially being removed from the VCF.

Issue: When I look at the outputs, I get drastically different numbers isolated to the hom_alts related fields, enough that for particular samples it can drastically skew relatedness. Here's an example of the file using gVCF inputs and no -u in the relate (I've clipped it for brevity, first row are replicates, second row completely unrelated samples):

#sample_a   sample_b    relatedness hom_concordance hets_a  hets_b  shared_hets hom_alts_a  hom_alts_b  shared_hom_altibs0  ibs2    n   x_ibs0  x_ibs2  expected_relatedness
SL362490    SL362491    0.999   0.999   6889    6885    6880    5679    5682    5676    0   17328   17328   0   359 -1.0
SL362490    SL409548    -0.004  -0.026  6889    6977    2929    5679    5694    2807    1477    7845    17316   66  162 -1.0

And here's the values from the same strelka2 runs, only using the VCFs and the -u option during the relate step:

#sample_a   sample_b    relatedness hom_concordance hets_a  hets_b  shared_hets hom_alts_a  hom_alts_b  shared_hom_altibs0  ibs2    n   x_ibs0  x_ibs2  expected_relatedness
SL362490    SL362491    0.999   0.997   6889    6885    6880    1315    1317    1315    2   17368   17384   0   197 -1.0
SL362490    SL409548    0.274   -0.381  6889    6977    2929    1315    1363    541 521 8855    17384   0   42  -1.0

There's a huge jump in relatedness (-0.004 -> 0.274) and the obvious culprit is the change the hom related fields. I'm planning to simply use the gVCF files going forward, but I wasn't able to really explain why the output was so drastically different given that these are from the exact same runs of the software. Thoughts on how to proceed?

brentp commented 4 years ago

hi, thanks for the clear report. --unknown/-u is not ideal and just provides a best-effort implementation for when only single-sample VCFs are available. that said, i think it could be improved. i agree that is is surprising that the hom_alt counts change so much.

any change you could share the vcf files for those 3 samples?

holtjma commented 4 years ago

I'll see what I can do. Assuming yes, same email as last time?

brentp commented 4 years ago

please and thanks. that would be great.

holtjma commented 4 years ago

Sent, interested to know what you find out! (Hopefully, I didn't do something stupid haha)

brentp commented 4 years ago

just to follow up on this and summarize what Matt Holt and I discussed via email:

in order to allow comparisons between genome builds (which can change reference alleles), somalier sorts alphabetically so that the lowest allele is always first. Because of how this is handled and the inclusion of options that allow --unknown to be treated as reference and that some GVCFs do not indicate the alternate, this can be inconsistent for various combinations of options. This is (now) a known limitation.