SACGF / variantgrid

VariantGrid public repo
Other
23 stars 2 forks source link

Somalier relateness investigation #183

Open davmlaw opened 3 years ago

davmlaw commented 3 years ago

This sample is related to way too many other samples (like 1200 of them over >.1):

https://variantgrid.com/snpdb/view_sample/1824

If you look at the overlaps in an analysis https://variantgrid.com/analysis/2251/ they look legit - eg 802/1766 HOM ALT in common

Sample  relatedness ibs0    ibs2    hom_concordance hets_a  hets_b  hets_ab shared_hets hom_alts_a  hom_alts_b  shared_hom_alts N   x_ibs0  x_ibs2
GAI013 (170803HamishScott_WGS_AGRF.gatk.exome_only.vcf.gz)  0.282   513 5122    0.18    6364    6392    10335   2485    5479    3125    1588    11000   65  111
sksmi commented 3 years ago

Posttransplant saliva - think we can probably assume it messes with the algorithm. Also, think that patient/family has been sequenced a tonne which probably complicates things.

davmlaw commented 3 years ago

Somalier switched algoritm to fix their issue 55 ( Somalier underestimates relatedness if low sites overlap #55)

proc rel(r:relation): float64 {.inline.} =
  return 2 * (r.shared_hets.float64 - 2 * r.ibs0.float64) / r.het_ab.float64

Somalier paper gives relatedness between sample i and sample j as:

(shared-hets(i,j)-2∗ibs0(i,j))/min (hets(i),hets(j))

where hets is the count of heterozygote calls per sample out of the assayed sites.

... unrelated individuals will have a concordance of around 0


davmlaw commented 3 years ago

Should also ensure we don't run this on tumor normal subtraction

davmlaw commented 3 years ago

Not sure if it's the cause of the error, but we are writing "-1" instead of "." for missing data in AD/DP

davmlaw commented 3 years ago

Fixed "-1" issue - re-running on vgaws in screen:

python3.8 manage.py somalier_existing_vcfs --clear
davmlaw commented 3 years ago

Some failed with:

    for i, (z, ad, dp, af) in enumerate(zip(samples_zygosity, allele_depth, read_depth, allele_frequency)):
TypeError: 'NoneType' object is not iterable

Might be worth hangling those cases. Could make somalier_existing_vcfs check for failed extractions

davmlaw commented 3 years ago

After re-run almost everything was exactly the same.

I then tried running it directly on a VCF, and checked the results vs after VCF produced from DB dump and they were very similar (slight diffs at 2 sig figs due to reference AD calcs)

I then looked at Somalier command line and it reccomended using --unknown to set unknown genotypes to hom-ref - often preferable to use this with VCF samples that were not jointly called

So now I count sample VCFs and if >1 use --unknown - tried running on all samples and it produced about 1/5 the number of records

Before: SomalierRelatePairs had 38,802 records (above our min cutoff) After: 2,149 pairs

davmlaw commented 3 years ago

Testing:

https://variantgrid.com/snpdb/view_sample/1824 - only has 1 related - 0.82 which is sample sample name in different VCF (FreeBayes)

However a known related group HSS2008 and family didn't have matches.

But if you look at the data:

/mnt/variantgrid/media_root/somalier/related/0bd17500-94fd-48c8-a7c9-60e8fdbe49c6/somalier.pairs.tsv

hss2008_4   hss2009_5   0.620   498 12544   -0.221  8400    8280    16680   6169    1758    1664    629 17384   0   90  -1.0
hss2008_4   hss2010_6   0.601   507 12409   -0.238  8400    7892    16292   5912    1758    1626    627 17384   0   138 -1.0

They were not shown as they had shared hom alts of 627/629 while we had minimum cutoff of 1000

Altered cutoff to 200... (I just made up 1000 before) - now there are 2614 (only a 20% increase)

TODO:

davmlaw commented 3 years ago

HSS2008 is related to both parents with 0.508/0.513 but with unknown its 0.336/0.324

Proabably the best thing to do is load it from the VCF when its shared and use --unknown when its not, but this will understate the relationship when its from different VCFs. I guess we can just document that....

It still detects identical samples ok