brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Matching Called VCF and Uncalled CRAMs #100

Open marcustutert opened 2 years ago

marcustutert commented 2 years ago

Hi Brent,

Been having trouble with using SOMALIER to apply to two different cohorts I want to confirm as a validation set for how SOMALIER performs to identify true duplicates between them. At the moment my two cohorts are a set of 22k jointly called (roughly 100k SNPs which passed QC) WES samples and a set of 22k CRAMs that match 1:1 to each of these called sequences. I've run my somalier extract command on the two datasets as outlined in the documentation and ended up with ~20k variants extracted in the called VCF cohort.

However I am running the SOMALIER relate command now and not seeing anything come up as a match (>0.9 relatedness). Is there something I am fundamentally doing wrong here?

Would appreciate your help!

Best, Marcus

brentp commented 2 years ago

Can you share the pairs.tsv for a single pair that should be identical?

marcustutert commented 2 years ago

Here's a screenshot, sorry rsync is being funny with my cluster atm:

Screenshot 2022-03-18 at 1 11 29 pm Screenshot 2022-03-18 at 1 11 34 pm
brentp commented 2 years ago

Are these low coverage or low quality? Do the cram samples cluster separately from the VCF samples? Can the html output give you any insight into what's going on? For truly related samples, obviously, you should have very few sites with IBS0 if that's not the case, there's not much to discern related pairs.

marcustutert commented 2 years ago

These are high quality and high coverage samples. The HTML output and the clustering suggests that there is relatedness here (I think?) but that the relatedness metric in the output is not providing much help.

For instance, if I take few of the extracted VCF somalier files and the matched extracted CRAM files I see a clustering of the true matched samples (in the upper right corner of the plot here) if instead of looking at relatedness I look at a measure if IBS2/n

Does this give you any more thoughts to what could be going wrong here:

Screenshot 2022-03-18 at 1 58 11 pm

EDIT: now running through all 20k by 20k pairwise comparisons (CRAM vs VCF samples) its clear that this cluster is a real effect and not just random noise:

Everything above 0.7 on the y-axis is a true matched sample (CRAM sample matched with identical VCF sample)

Screenshot 2022-03-18 at 2 33 21 pm
marcustutert commented 2 years ago

OK I think where I'm at Brent at the moment is that I am convinced that for my use-case (identifying duplicates between cohorts of called and uncalled data) that using a measure of ibs2/n at some threshold (eg. probably around 0.75 based on the examples I've played with) will be good enough to match the duplicates between the CRAM and VCF and that due to the inflation of ibs0 calls that the relatedness metric performs poorly.

However this presents a bit of a computational problem for me as you can see in the plot above and others that there isn't a minimum relatedness metric I can filter on (besides maybe >0 but that will only filter at most 50% of the pairs based on an eye test) when running SOMALIER relate.

Consequently if I wanted to use the ibs2/n threshold after generating the pairs.tsv I would effectively need to generate all possible pairs of samples and in my case since my total cohort size numbers in the ~100k range, that means 10 billion matches. Waiting for that to output and also the storage I have are going to be bottlenecks that effectively make this an unworkable solution, at least to brute force it. There are some clever things I could do in terms of submitting jobs to the cluster etc and coming up with an inventive engineering solution.

Instead I was hoping I could ask you a massive favour (and you've already been so helpful so i hope this is the very last thing I have to ask!) if you could whip up a binary for me that will generate a pairs.tsv file with an additional column that computes ibs2/n and also an option (similar to one you've provided with the relatedness metric) in the SOMALIER relate command to filter the output based on the ibs2/n being above some value (eg. 0.75)

If that is too difficult I'm happy to keep talking about a solution to this problem given the software at hand but I think the above really should solve all my problems!

Best wishes, Marcus

brentp commented 2 years ago

I don't want to add this because it's not really a general thing. There's a problem in your data (or perhaps in the software) that's causing related samples to have high IBS0. That's why the relatedness is not working. If you can share the somalier files for 2 related samples with a low relation score, then I can have a look.