Gold standard 1000 genomes haplogroups

stephenturner commented 3 years ago

Hi David! Thanks for making this public. I saw the line in the preprint:

We constructed a gold-standard set of haplogroup calls using a semi-automated method3,7 and an up-to-date version of the of the ISOGG database.

Do you have this data anywhere? Could you commit this to the repo? I'm benchmarking some different SNP-based Y haplogrouping tools, and I'm having trouble figuring out what truth data might look like or how to derive it. Thanks!

dpoznik commented 3 years ago

Hi Stephen, thanks for your interest in yhaplo!

It looks like the file of haplogroup calls using my old semi-automated method, but based on the same ISOGG version used by yhaplo, has not survived. However, I did find some notes that are essentially an annotated diff.

You can download the 1000 Genomes Y-chromosome VCF linked on page 3 of the manual and run them through yhaplo to get a set of calls. The --all_aux_output option will tell you in detail the basis for each call, which may help for benchmarking. According to these notes, yhaplo matched or bested the older method for 1240/1244 samples. Here are details on the 4 calls for which the semi-automated method yielded a better call:

HG02088 O2b1 
    has ancestral G at: L1120 A0-T 14496439 G->T
HG02090  Q1a2a1                    Q                         *
    has ancestral G at: P36.2 Q1 14496441 G->T
HG01961  Q1a2a1a1                  Q                         *
    has ancestral G at: L232 Q1 17516095 G->A
HG02433  I1a2a1a2                  I1a2a1                    .
    has 1-read ancestral A at: Z141            I1a2a1a                    7530813 A->G     Z141

Note that these lines use the YCC notation, as they are easier to compare.

Hope that helps!

stephenturner commented 3 years ago

Thanks for the quick reply David. I should mention that one of the reasons I'm looking around at other tools is I have to find something that's permissively licensed - MIT, BSD, GPL, etc, which was why I was looking for the "gold standard" haplogroup calls based on the semi-automated method you described here. If you've got those somewhere or even if you've got yhaplo results compiled already for the 1244 1kg samples that'd be helpful!

dpoznik commented 3 years ago

Ah, I see. I conducted the same procedure based an earlier version of the ISOGG database, and I imagine that should suit your purposes. Check out file 3, 3.haplogroups.txt, in the Supplementary Data bundle of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4884158/

Best, David

stephenturner commented 3 years ago

Ah, thanks. This looks very similar to the data at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/chrY/1000Y.sample_level_haplogroup_calls.ver5.txt. Same?

dpoznik commented 3 years ago

Yep; same file, different name.

23andMe / yhaplo

Gold standard 1000 genomes haplogroups #16