Closed ch-kr closed 2 years ago
Mark would like to also see the underlying distribution of # of singletons per person in the UKBB callset. He finds the # of parent/child and sibling pairs that were missed in your initial analysis unusual, because he expects ~100 singletons per person and that means you'd expect doubleton sharing between parent/child and siblings to be ~50 -- unless there's more expanded family structures present in the callset.
So missing so many parent/child and sibling pairs based on a doubleton analysis is weird and unexpected.
Started looking at missed duplicates/twins. First thing checked: whether samples were part of a trio. Only one of the three pairs is in a trio:
Putting this as blocked for now, we will resume after analysis using King
after some discussion, we've decided to start working on this ticket as Leo works on KING/pc-relate comparison and as hail team builds new pc-relate
next step is to look at this: underlying distribution of # of singletons per person in the UKBB callset
Cluster config used for analyses on 5/18, 5/19 (forgot to add):
hailctl dataproc start kc --autoscaling-policy=max-20 --master-machine-type n1-standard-8 --worker-machine-type n1-standard-8 --project broad-mpg-gnomad --max-age=8h --packages gnomad --num-workers 2
Command:
hailctl dataproc submit kc doubletons.py --slack-channel "@kc (she/her)" --slack-token xxx --temp-path gs://gnomad-tmp/kc/
Job ID: ebc825b3d1dd43548506ebb7087fd723
Notebooks used for comparisons:
gs://gnomad-kc/notebooks/doubletons.ipynb
gs://gnomad-kc/notebooks/doubletons_debug.ipynb
Cluster config for singleton analyses on 7/7:
hailctl dataproc start mw1 --autoscaling-policy=max-20 --master-machine-type n1-standard-8 --worker-machine-type n1-standard-8 --init gs://gnomad-mwilson/init.sh --max-idle=30m --requester-pays-allow-buckets gnomad-tmp,gnomad-mwlson,gnomad --packages gnomad
Command:
hailctl dataproc submit mw1 singleton_utils.py
Job ID: 38764f697bee47c1a2fbd882973553c9
This job will grab the n singletons per individual in UKBB. Next step is just plotting the distribution.
Will likely have follow-up items after gnomAD meeting; possible next steps could be to use tripletons rather than doubletons and to remove the biallelic SNP filter
Dataset was filtered like the doubleton analysis: high quality autosomal sites and adj entries. Approx median: 29 Mean: 34.56 stdev: 24.78 Min: 0 Max: 411.0 Code: https://github.com/broadinstitute/tgg_methods/pull/47
I removed the bi-allelic SNP filter from the script and reran the analyses. Approx median: 35 Mean: 40.56 StDev: 26.8 Min: 0 Max: 451
Next step is rerunning KCs doubleton analyses with tripletons
Slack thread for singleton distribution: https://atgu.slack.com/archives/CRA2TKTV0/p1657820227511769
Mark asked for a few next plots:
See also the general comments on the plots above: https://atgu.slack.com/archives/CRA2TKTV0/p1657891688786999?thread_ts=1657820227.511769&cid=CRA2TKTV0
I'm hitting an off-heap memory issue when trying to run the n non ref code. I've posted to zulip https://hail.zulipchat.com/#narrow/stream/123010-Hail-Query-0.2E2-support/topic/Hail.20off-heap.20memory.20exceeded.20maximum.20threshold/near/290902183
There was a bug in the code which was causing hail to run out of memory. This is fixed and both samples non ref == 3 and samples non ref == 2 ran using the updated code. The cluster config was pretty meaty because I had jumped to highmem machines in an attempt to get around the OOM errors but it looks like everything was well utilized.
hailctl dataproc start mw1 --autoscaling-policy=max-20 --master-machine-type n1-highmem-8 --worker-machine-type n1-highmem-8 --init gs://gnomad-mwilson/init.sh --requester-pays-allow-buckets gnomad-tmp,gnomad-mwlson,gnomad --no-off-heap-memory --max-idle=30m
Jobs: n non ref ==2 https://console.cloud.google.com/dataproc/jobs/643ca21d749d4caeba145b4795fe90e9/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad n non ref == 3 (two jobs because one failed for a silly argument I left in but after a checkpoint so I just restarted with _read_if_exists=True) https://console.cloud.google.com/dataproc/jobs/e87f87fc32e24724bc78457dd543e24d/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad https://console.cloud.google.com/dataproc/jobs/b97ee208ba694b30ac3077011a84745b/monitoring?region=us-central1&authuser=0&project=broad-mpg-gnomad Next steps are to join the tables and make Mark's requested plots.
Posted plots that Mark had asked for a couple week ago in the gnomad_qc channel: https://atgu.slack.com/archives/CRA2TKTV0/p1659379153354149?thread_ts=1657820227.511769&cid=CRA2TKTV0
Mark would like to see:
Plotted first two bullet points in the above comment here: https://atgu.slack.com/archives/CRA2TKTV0/p1660597546290599?thread_ts=1657820227.511769&cid=CRA2TKTV0 . Waiting to hear Mark's thoughts on the shift in the distribution for relateds and also asked what plots he would like to see in the third point.
Pinged Mark on this again. We will very likely need to run KING on UKB. KC mentioned theres a MT from the recalculated freqs hanging around containing EUR samples. I could use that to filter to qc sites and then Leo's repo is fairly well documented for prepping a MT and running cuKING.
related to #552, so closing this ticket as well
thank you for flagging this!
Look into why finding doubleton pairs failed to find sample dups/twins