23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

Updating ISOGG reference #9

Closed teepean closed 5 years ago

teepean commented 5 years ago

Hello!

I tried updating ISOGG database to the latest one but get a lot of errors caused by conflicting SNPs. What should be done about those and how to determine which one should be used?

An example:

ERROR! Conlicting SNPs: FGC29577 I2a1b2a 15472184 C->G Y10712 I2a1b2a 15472184 G->G

In database:

FGC29577    I2a1b2a Y10718  rs7892998   15472184    C->G
Y10712  I2a1b2a FGC29614    rs7892998   15472184    G->G
dpoznik commented 5 years ago

The ISOGG database is a great resource, but it's just a starting point. I imagine it would be a pretty big job to validate the SNPs that have been added since the snapshot currently used by yhaplo. You could use 1000 Genomes data to start. For example, you could identify 1000 Genomes lineages carrying other SNPs in the clade of interest and then assess the allelic distribution for the SNPs of interest. But of course this will only cover SNPs on lineages present in 1000 Genomes.

In your specific example, it's pretty clear that G->G is not a valid "acestral->derived" mutation, so you could either remove that line from your input file or add a line to a blacklist input file (input/isogg.omit.*.txt).

The ISOGG version yhaplo currently uses should be sufficient to classify haplogroups to a pretty deep granularity. If you needed greater resolution for any particular subclade of interest, you could build a phylogeny from your sequences. See these references for more details: http://science.sciencemag.org/content/341/6145/562 https://www.nature.com/articles/ng.3559

Hope that helps.