Investigate doubleton results (missing duplicates)

ch-kr commented 2 years ago

Look into why finding doubleton pairs failed to find sample dups/twins

gtiao commented 2 years ago

Mark would like to also see the underlying distribution of # of singletons per person in the UKBB callset. He finds the # of parent/child and sibling pairs that were missed in your initial analysis unusual, because he expects ~100 singletons per person and that means you'd expect doubleton sharing between parent/child and siblings to be ~50 -- unless there's more expanded family structures present in the callset.

gtiao commented 2 years ago

So missing so many parent/child and sibling pairs based on a doubleton analysis is weird and unexpected.

ch-kr commented 2 years ago

Started looking at missed duplicates/twins. First thing checked: whether samples were part of a trio. Only one of the three pairs is in a trio:

jkgoodrich commented 2 years ago

Putting this as blocked for now, we will resume after analysis using King

ch-kr commented 2 years ago

after some discussion, we've decided to start working on this ticket as Leo works on KING/pc-relate comparison and as hail team builds new pc-relate

next step is to look at this: underlying distribution of # of singletons per person in the UKBB callset

ch-kr commented 2 years ago

Cluster config used for analyses on 5/18, 5/19 (forgot to add):

hailctl dataproc start kc --autoscaling-policy=max-20 --master-machine-type n1-standard-8 --worker-machine-type n1-standard-8 --project broad-mpg-gnomad --max-age=8h --packages gnomad --num-workers 2

Command:

hailctl dataproc submit kc doubletons.py --slack-channel "@kc (she/her)" --slack-token xxx --temp-path gs://gnomad-tmp/kc/

Job ID: ebc825b3d1dd43548506ebb7087fd723

Notebooks used for comparisons:

gs://gnomad-kc/notebooks/doubletons.ipynb
gs://gnomad-kc/notebooks/doubletons_debug.ipynb

mike-w-wilson commented 2 years ago

Cluster config for singleton analyses on 7/7:

hailctl dataproc start mw1 --autoscaling-policy=max-20 --master-machine-type n1-standard-8 --worker-machine-type n1-standard-8  --init gs://gnomad-mwilson/init.sh --max-idle=30m --requester-pays-allow-buckets gnomad-tmp,gnomad-mwlson,gnomad --packages gnomad

Command:

hailctl dataproc submit mw1 singleton_utils.py

Job ID: 38764f697bee47c1a2fbd882973553c9

This job will grab the n singletons per individual in UKBB. Next step is just plotting the distribution.

ch-kr commented 2 years ago

Will likely have follow-up items after gnomAD meeting; possible next steps could be to use tripletons rather than doubletons and to remove the biallelic SNP filter

mike-w-wilson commented 2 years ago

Dataset was filtered like the doubleton analysis: high quality autosomal sites and adj entries. Approx median: 29 Mean: 34.56 stdev: 24.78 Min: 0 Max: 411.0 Code: https://github.com/broadinstitute/tgg_methods/pull/47

mike-w-wilson commented 2 years ago

I removed the bi-allelic SNP filter from the script and reran the analyses. Approx median: 35 Mean: 40.56 StDev: 26.8 Min: 0 Max: 451

mike-w-wilson commented 2 years ago

Next step is rerunning KCs doubleton analyses with tripletons

mike-w-wilson commented 2 years ago

Slack thread for singleton distribution: https://atgu.slack.com/archives/CRA2TKTV0/p1657820227511769

gtiao commented 2 years ago

Mark asked for a few next plots:

Plot histogram of doubleton+tripleton sharing per pair in the callset
Also plot the distribution of the maximum number of doubletons+tripletons shared with another person in the callset, per person; this can then be plotted against the per-person singleton rate

See also the general comments on the plots above: https://atgu.slack.com/archives/CRA2TKTV0/p1657891688786999?thread_ts=1657820227.511769&cid=CRA2TKTV0

mike-w-wilson commented 2 years ago

I'm hitting an off-heap memory issue when trying to run the n non ref code. I've posted to zulip https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Hail.20off-heap.20memory.20exceeded.20maximum.20threshold/near/290902183

mike-w-wilson commented 2 years ago

There was a bug in the code which was causing hail to run out of memory. This is fixed and both samples non ref == 3 and samples non ref == 2 ran using the updated code. The cluster config was pretty meaty because I had jumped to highmem machines in an attempt to get around the OOM errors but it looks like everything was well utilized. hailctl dataproc start mw1 --autoscaling-policy=max-20 --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --init gs://gnomad-mwilson/init.sh --requester-pays-allow-buckets gnomad-tmp,gnomad-mwlson,gnomad --no-off-heap-memory --max-idle=30m

Jobs: n non ref ==2 https://console.cloud.google.com/dataproc/jobs/643ca21d749d4caeba145b4795fe90e9/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad n non ref == 3 (two jobs because one failed for a silly argument I left in but after a checkpoint so I just restarted with _read_if_exists=True) https://console.cloud.google.com/dataproc/jobs/e87f87fc32e24724bc78457dd543e24d/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad https://console.cloud.google.com/dataproc/jobs/b97ee208ba694b30ac3077011a84745b/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad Next steps are to join the tables and make Mark's requested plots.

mike-w-wilson commented 2 years ago

Posted plots that Mark had asked for a couple week ago in the gnomad_qc channel: https://atgu.slack.com/archives/CRA2TKTV0/p1659379153354149?thread_ts=1657820227.511769&cid=CRA2TKTV0

mike-w-wilson commented 2 years ago

Mark would like to see:

First degree relateds, pc_relate >0.2, doub_trip distribution
doub_trip distribution by ancestry
If first degree relateds is shifted to the high end of the overall distribution, a correlation analysis between doub_trip count and pc relate & KING estimates of IBD sharing

mike-w-wilson commented 2 years ago

Plotted first two bullet points in the above comment here: https://atgu.slack.com/archives/CRA2TKTV0/p1660597546290599?thread_ts=1657820227.511769&cid=CRA2TKTV0 . Waiting to hear Mark's thoughts on the shift in the distribution for relateds and also asked what plots he would like to see in the third point.

mike-w-wilson commented 2 years ago

Pinged Mark on this again. We will very likely need to run KING on UKB. KC mentioned theres a MT from the recalculated freqs hanging around containing EUR samples. I could use that to filter to qc sites and then Leo's repo is fairly well documented for prepping a MT and running cuKING.

ch-kr commented 2 years ago

related to #552, so closing this ticket as well

ch-kr commented 1 year ago

thank you for flagging this!

broadinstitute / gnomad_methods

Investigate doubleton results (missing duplicates) #460