brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
270 stars 35 forks source link

Relate Killed? #113

Open kevwilhelm95 opened 1 year ago

kevwilhelm95 commented 1 year ago

Hello,

I am trying to run relate on > 400,000 samples. I know this is quite the task and I believe I have figured out a way to run while chunking the samples into 6 different groups and comparing head to head. For example, I would compare split 1 to split 2 which would consist of ~78,000 samples in each group (~156,000 per relate run). I thought this would fix my issues with memory, however, when I run Somalier, it still gets killed within seconds. I calculated I would need at least 42 Gb of memory to run this and my machine has 250Gb available. Any other ideas of how to work around this?

brentp commented 1 year ago

That's a large cohort! What is the exitcode when it fails within seconds? I think you might be hitting memory limits. Maybe you can split to smaller groups of e.g. 20K?

Also, to make sure I understand your splits ,you are sending a total of 78K samples per run? Is that right?

kevwilhelm95 commented 1 year ago

This is the error command - b'somalier version: 0.2.16\n[somalier] starting read of 78306 samples\nKilled\n'. I have run it combining splits (~156,000) and just within one split (78,306 samples). I guess I will just keep splitting until I have enough resources to run.

kevwilhelm95 commented 1 year ago

@brentp Another question that I am curious about, my group has previously used KING to estimate the relatedness of individuals in our cohort, but our cohort has grown so large that we are testing out Somalier. Using KING, we defined 2nd degree relatives as those with kinship > 0.0884. Does this threshold fit with the relatedness predictions from Somalier or is there another threshold we should use?

Thank you for all of your hard work on this

brentp commented 1 year ago

That cutoff should generally be ok. The problem is if you have 20K samples, and a single sample with low-quality, that may appear to be related at that level to 19,999 other samples. So you do have to do some filtering. I use the html plot to decide on reasonable cutoffs--this will be difficult with the size of your cohort, but should work for 10-20K as it sub-samples unrelated samples pretty heavily.

Another thought is that it's most likely to die generating the JSON for the html as that uses a lot of memory. But even in that case, you may still get the text-output.