23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

Update to new versions of ISOGG Y-DNA haplogroup tree #3

Closed alexhbnr closed 6 years ago

alexhbnr commented 6 years ago

Hello,

I was wondering whether there are scripts available or will be included in the future that would enable to update to a newer version of the ISOGG Y-DNA haplogroup tree?

The current version of yhaplo comes along with a ISOGG version of January 4th 2016. I would like to use a newer version of ISOGG Y-DNA haplogroup tree instead as I realized that there are some mistakes in the older versions, e.g. swapped ancestral and derived alleles at certain SNPs, that are fixed in newer versions. Did you get the data provided by ISOGG manually in the input format required by yhaplo or do you have some scripts that you are willing to share?

dpoznik commented 6 years ago

Alex, thanks for writing. The short answer is that there are no short-term plans to update, but I have a few suggestions that may suit your needs.

First, some background. I constructed input/isogg.2016.01.04.txt by copying a table from a web page ISOGG used to host. yhaplo parses this file, correcting for formatting glitches on the fly and correcting a number of errors I found therein by reading from a few supplementary input files. In particular, in testing, I detected a number of ancestral/derived allele swaps and other such errors. I shared these with the ISOGG folks, and I believe they've made the corresponding corrections on their website. So yhaplo does use the corrected data for any SNPs currently listed in one of the input/isogg.correct.*.txt files.

If your primary interest is to correct additional ancestral/derived allele swaps, that should be easy. You could add any corrections to input/isogg.correct.polarize.txt locally. It would also be easy to systematically compare input/isogg.2016.01.04.txt to ISOGG's current SNP Index and supplement input/isogg.correct.polarize.txt with any polarization changes not already accounted for.

If you want to use an updated version of the tree, that would be a bit more complicated, but still feasible. yhaplo uses the (SNP name)-to-(YCC haplogroup label) mappings in input/isogg.2016.01.04.txt to build out the tree. Unfortunately, ISOGG no longer maintains a table directly linking SNP names to YCC haplogroup labels, but such a table could be constructed as follows:

  1. Parse the ISOGG tree pages to link YCC labels to SNP names. I've actually written an HTML parser (attached) that essentially does this as a side effect, so most of the work on this point is done.
  2. To get coordinates and ancestral/derived states, merge the above with the current ISOGG SNP Index, keying on SNP name. This step should be trivial.
  3. Assess consistency of the updated tree structure and haplogroup labels in test data. This step would be the most time-consuming, and, unfortunately, I don't have the bandwidth for it at the moment.

I hope that helps.

dpoznik commented 6 years ago

P.S. Here is that HTML parser; just remove the .txt extension. goParseIsoggHTML.py.txt

alexhbnr commented 6 years ago

Thanks, David, for the detailed answer. Sorry for not having replied sooner, too many unrelated tasks due to a paper submission came in my way. I will work through your suggestions in the next days but you already helped me a lot.

Working with the data provided by the ISOGG on their website is totally not user-friendly because all their data formats are a pain to parse before being readable in Python or R.

Once again, thanks for sharing it!