brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Somalier with pooled parents #136

Open tetedange13 opened 1 month ago

tetedange13 commented 1 month ago

Hi,

First thanks for developping somalier, it is a great tool !

In my team we have exome data, with pooled parents, most of the time 4 mums and 4 dads together => I run somalier directly on BAMs and I would have a few questions if you do not mind :

Thanks for any kind of help on this ! Best regards, Felix.

brentp commented 1 month ago

Hi Felix, do you mean by pooled that all reads from all samples are mixed, without barcodes so you don't know which reads came from which samples? It's possible that somalier can help here, but it's not designed for that. And certainly, --infer will not work well (if at all) for that case. If you children are sequenced individually, you could look at the rate of IBS0 to the parent pool. That should be very close to 0 if the parent is in the pool, but even that might not be reliable because if only a single parent has the allele, the ratio will be very low and it might be called as hom-ref.

tetedange13 commented 2 weeks ago

Thanks for your quick answer !

Yes I meant "pooled parents" exactly as you described and our children are well sequenced individually

For relatedness, IBS0 is indeed a good indicator => With child having always a IBS0 under 20 with their parental pool (versus IBS0 above 50 with any other unrelated pool)

I also found Homozygous concordance to be a good metric too => With "child - pooled_parent" relationships always being above 0.6-0.65 when parent is well in the pool (and lower otherwise) => All "pool to pool" relationships exhibit low IBS0, but they never have high enough "hom_concord" (so even better metric than IBS0 in my case ?)

If --ped is the method to go, I would really benefit from being able to have duplicate sampleID in input PED (at condition that they have different famID) => It would be essentially to have a correct "expectedrelatedness" set in "pairs.tsv" => For all possible "child{1,2,3,4} - pooled_parents_1+2+3+4" relationships of a given pool (I hope I am clear enough here)

tetedange13 commented 2 weeks ago

In regard of guessing from data the number of samples pooled together, I also made some progress : (somalier_relate.html is very handy for all that)

Number of samples in pool n_hom_ref
1 > 5000
2 ~ 2500
3 ~ 1500
4 ~ 1000

=> I rather use "fraction of hom_alt" (= hom_alt / (hom_alt +het + hom_ref)" => And after plotting this fraction against "expected_ploidy", I found a good linear correlation => With int(-12.5 * frac_hom_alt + 5.3) giving a rounded estimate of number of samples in pool

Thanks again ! Best regards, Felix.