Closed stephenturner closed 3 years ago
Hi Stephen, thanks for your interest in yhaplo
!
It looks like the file of haplogroup calls using my old semi-automated method, but based on the same ISOGG version used by yhaplo
, has not survived. However, I did find some notes that are essentially an annotated diff.
You can download the 1000 Genomes Y-chromosome VCF linked on page 3 of the manual and run them through yhaplo
to get a set of calls. The --all_aux_output
option will tell you in detail the basis for each call, which may help for benchmarking. According to these notes, yhaplo
matched or bested the older method for 1240/1244 samples. Here are details on the 4 calls for which the semi-automated method yielded a better call:
HG02088 O2b1
has ancestral G at: L1120 A0-T 14496439 G->T
HG02090 Q1a2a1 Q *
has ancestral G at: P36.2 Q1 14496441 G->T
HG01961 Q1a2a1a1 Q *
has ancestral G at: L232 Q1 17516095 G->A
HG02433 I1a2a1a2 I1a2a1 .
has 1-read ancestral A at: Z141 I1a2a1a 7530813 A->G Z141
Note that these lines use the YCC notation, as they are easier to compare.
Hope that helps!
Thanks for the quick reply David. I should mention that one of the reasons I'm looking around at other tools is I have to find something that's permissively licensed - MIT, BSD, GPL, etc, which was why I was looking for the "gold standard" haplogroup calls based on the semi-automated method you described here. If you've got those somewhere or even if you've got yhaplo results compiled already for the 1244 1kg samples that'd be helpful!
Ah, I see. I conducted the same procedure based an earlier version of the ISOGG database, and I imagine that should suit your purposes. Check out file 3, 3.haplogroups.txt
, in the Supplementary Data bundle of this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4884158/
Best, David
Ah, thanks. This looks very similar to the data at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/chrY/1000Y.sample_level_haplogroup_calls.ver5.txt. Same?
Yep; same file, different name.
Hi David! Thanks for making this public. I saw the line in the preprint:
Do you have this data anywhere? Could you commit this to the repo? I'm benchmarking some different SNP-based Y haplogrouping tools, and I'm having trouble figuring out what truth data might look like or how to derive it. Thanks!