brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Increasing Number of SNPs improving inference? #95

Open marcustutert opened 2 years ago

marcustutert commented 2 years ago

Hi,

Just a general question but have you ever looked at the performance of somalier to detect relatedness as a function of the Nsnps in the samples? I noticed on the documentation you provided a general description of the algorithim and suggested that with only a few 10s of SNPs that relatedness metrics were well calibrated. Would there be any benefit at all (taking into account the increased runtime I assume?) in running somalier with as many shared SNPs as possible between two cohorts?

Thanks.

brentp commented 2 years ago

Hi, I did this for peddy (a predecessor to somalier). You can see that increasing the number of sites quickly plateau's: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5339084/figure/fig1/

the benefit of having more sites is that if you have cohorts for targeted regions or spotty coverage, then you would still potentially have enough sites.

Note that the site selection in somalier is better than for peddy, so those might plateau even sooner.

marcustutert commented 2 years ago

Thanks Brent. I think I'll stick with my 1000 SNPs that intersect between the cohorts then. The other option I had was to do something complicated and do pairwise intersections between my cohorts to maximize SNPs (I've done this with KING and it worked great, but for whatever reason(?), KING likes there to be lots of lots of SNPs to estimate the relatedness as opossed to somalier) but it seems that I won't have to do this with somalier. Should save me some work!

Cheers.