Open davmlaw opened 3 years ago
Posttransplant saliva - think we can probably assume it messes with the algorithm. Also, think that patient/family has been sequenced a tonne which probably complicates things.
Somalier switched algoritm to fix their issue 55 ( Somalier underestimates relatedness if low sites overlap #55)
proc rel(r:relation): float64 {.inline.} =
return 2 * (r.shared_hets.float64 - 2 * r.ibs0.float64) / r.het_ab.float64
Somalier paper gives relatedness between sample i and sample j as:
(shared-hets(i,j)-2∗ibs0(i,j))/min (hets(i),hets(j))
where hets is the count of heterozygote calls per sample out of the assayed sites.
... unrelated individuals will have a concordance of around 0
Should also ensure we don't run this on tumor normal subtraction
Not sure if it's the cause of the error, but we are writing "-1" instead of "." for missing data in AD/DP
Fixed "-1" issue - re-running on vgaws in screen:
python3.8 manage.py somalier_existing_vcfs --clear
Some failed with:
for i, (z, ad, dp, af) in enumerate(zip(samples_zygosity, allele_depth, read_depth, allele_frequency)):
TypeError: 'NoneType' object is not iterable
Might be worth hangling those cases. Could make somalier_existing_vcfs check for failed extractions
After re-run almost everything was exactly the same.
I then tried running it directly on a VCF, and checked the results vs after VCF produced from DB dump and they were very similar (slight diffs at 2 sig figs due to reference AD calcs)
I then looked at Somalier command line and it reccomended using --unknown to set unknown genotypes to hom-ref - often preferable to use this with VCF samples that were not jointly called
So now I count sample VCFs and if >1 use --unknown - tried running on all samples and it produced about 1/5 the number of records
Before: SomalierRelatePairs had 38,802 records (above our min cutoff) After: 2,149 pairs
Testing:
https://variantgrid.com/snpdb/view_sample/1824 - only has 1 related - 0.82 which is sample sample name in different VCF (FreeBayes)
However a known related group HSS2008 and family didn't have matches.
But if you look at the data:
/mnt/variantgrid/media_root/somalier/related/0bd17500-94fd-48c8-a7c9-60e8fdbe49c6/somalier.pairs.tsv
hss2008_4 hss2009_5 0.620 498 12544 -0.221 8400 8280 16680 6169 1758 1664 629 17384 0 90 -1.0
hss2008_4 hss2010_6 0.601 507 12409 -0.238 8400 7892 16292 5912 1758 1626 627 17384 0 138 -1.0
They were not shown as they had shared hom alts of 627/629 while we had minimum cutoff of 1000
Altered cutoff to 200... (I just made up 1000 before) - now there are 2614 (only a 20% increase)
TODO:
HSS2008 is related to both parents with 0.508/0.513 but with unknown its 0.336/0.324
Proabably the best thing to do is load it from the VCF when its shared and use --unknown when its not, but this will understate the relationship when its from different VCFs. I guess we can just document that....
It still detects identical samples ok
This sample is related to way too many other samples (like 1200 of them over >.1):
https://variantgrid.com/snpdb/view_sample/1824
If you look at the overlaps in an analysis https://variantgrid.com/analysis/2251/ they look legit - eg 802/1766 HOM ALT in common