23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

More clear guidance as to chrY SNP count requirements #5

Closed bryketos closed 6 years ago

bryketos commented 6 years ago

Documentation and PDF mention there are ~20k haplogroup-informative SNPs. Documentation and PDF also note that "low coverage genotyping" might not be enough density to make haplogroup calls. Put a precise number on this, or better yet, a tradeoff between genotyping density and accuracy. Is 100 too few? 1,000? 10,000?

dpoznik commented 6 years ago

Thanks for your question. Sorry, I don't see the phrase "low coverage genotyping" in the manuscript, manual, or README.md, so I'm not sure whether you're referring to low-coverage sequencing, which is mentioned in the README or to a "genotyping platform [that] sparsely covers regions of the tree." Based on the numbers toward the end of your question, I'll assume it's the latter.

Unfortunately, the answer is "it depends." It's less a question of what the algorithm requires and more a question of how granular you want your classifications to be and how diverse your sample is.

You can get a sense of the haplogroup distribution of SNPs in the example input file by first running yhaplo and then looking at the second column of the cleaned-up SNP list:

awk '{ print $2 }' output/isogg.snps.unique.2016.01.04.txt | sort | uniq -c

There are 1721 unique haplogroup labels in the current version of the input file. So a couple thousand well-chosen SNPs would completely cover the tree from this perspective. A few hundred would also do a good, but less granular job, and a few dozen would be sufficient to assign individuals to major branches of the tree. If the sample of interest were from a particular region of the world, these numbers could be further reduced.

Hope that helps.

bryketos commented 6 years ago

Thanks for the response - I might request a bit more quantitation around this in the future but adding loose guidelines with respect to commonly used genotyping arrays vs. exome sequencing vs. WGS might be helpful.

Thanks again for the insight. Great software and I've put it to good use in an exciting study. Is it okay to cite your bioRxiv manuscript?

dpoznik commented 6 years ago

Great! I'm glad to hear you've found it useful. Yes, please do cite the bioRxiv manuscript.