23andMe / yhaplo

Identifying Y-chromosome haplogroups in arbitrarily large samples of sequenced or genotyped men
Other
103 stars 24 forks source link

Enabling yHaplo to be used with incomplete SNP data and investigate haplogroup delineation more easily #6

Closed alexhbnr closed 5 years ago

alexhbnr commented 6 years ago

The changes provide two features for people working with incomplete Y chromosome sequencing data as it is case when using SNP arrays or target-capture approaches and the ones that are interested in exact haplogroup delineation:

  1. It provides the option to specify the minimum number of ancestral alleles to be observed on a path before terminating the search further below it. Until now, it was set as a constant (ancStopThresh = 2), but they often fails the assignment when the haplogroup-specific, derived SNPs are not present in the data set for a certain branch, but the data set only contains derived SNPs for the branch below it. By being able to specify a higher threshold, samples with complete information can be assigned correctly.

  2. yHaplo so far provided the option "haplogroupPaths" that reported the number of derived SNPs found on each branch of Y phylogeny until its leaf. When investigating whether a sample is part of an exisiting haplogroup sub-lineage, it is also interesting to see, which ISOGG SNPs were in fact derived. Currently, this information has to be extracted from derSNPsDetail. The new option "haplogroupPathsDetail" combines the two options by replacing the number of derived SNPs with a comma-separated list of the derived SNP names in the output of the option "haplogroupPaths".

I did not add anything to the manual or README, though, because I think that the options should be rather self-explanatory from the command-line help.

dpoznik commented 5 years ago

Hi Alex, apologies for the delay in my reply.

I'm not sure I understand the motivation for the first change you've described. yhaplo already supports haplogroup calling on incomplete data; it's currently used in production on array data here at 23andMe.

option to specify the minimum number of ancestral alleles to be observed on a path before terminating the search further below it.

The reason this parameter is set in config.py rather than from a command-line argument is that I can't think of a good reason to change it. Please see the docstring of tree.identifyPhylogeneticPath for a description of the stopping condition. In particular, the ancStopThresh parameter is only used in this line:

            if path.node.isLeaf() \
                or (numAncestral  > self.config.ancStopThresh) \
                or (numAncestral == self.config.ancStopThresh and numDerived == 0):

If more than two ancestral alleles are observed on a branch (or if there are exactly two genotyped SNPs and both alleles are ancestral), this is strong evidence that the test lineage does not descend from this branch. If derived alleles are observed on lower branches, that would suggest the presence of uncorrected inconsistencies within in the ISOGG input file. It would be most prudent to correct any such input inconsistencies rather than to accept haplogroup calls descending from branches on which multiple ancestral alleles are present.

often fails the assignment when the haplogroup-specific, derived SNPs are not present in the data set for a certain branch

yhaplo will only curtail a particular search path when it observes ancestral alleles (see above code snippet). In the absence of observed ancestral alleles, yhaplo will explore all paths.

Best, David

alexhbnr commented 5 years ago

Hi David,

I agree with you that the stop parameter regarding the number of ancestral alleles observed at a branch works perfectly fine in 99% of the cases. For modern Y chromosome sequences, I get the correct results when using the default settings without changing the parameter either having whole-genome sequencing data or only a subset of the Y chromosome.

However, there is a certain edge case when it doesn't work correctly, I think. When your sequencing data doesn't cover all ISOGG SNPs but only a subset of it, the problem might occur that you do not have any derived SNPs but only ancestral SNPs for a certain branch but derived SNPs for a descendent branch. This might be the case when the selection of Y chromosome SNPs for capture-enrichment was not equally distributed across the Y haplogroup tree or when ancient DNA samples are only shallowly sequenced.

Here, the question arises how you want to determine haplogroups? By default, y-haplo stops after seeing a branch without any derived SNPs and two ancestral SNPs and terminates the search here. However, if you have incomplete data, I personally prefer to take the extra time and do a more thorough search to avoid having incomplete samples assigned to a node too ancestral given the data. In worst case, incomplete samples are all assigned to node B-T or even A. By allowing to vary the parameter of the number of observed ancestral alleles before terminating the search on this branch, one could circumvent this problem.

I think I am not the only one with this issue: van de Loosdrecht et al. (DOI: 10.1126/science.aar8380) write in their Supplementary Material in section S12:

"To determine the Y chromosome haplogroup of the Taforalt male individuals, we used the yHaplo program (106), which relies on the ancestry-informative markers in the ISOGG (International Society of Genetic Genealogy) data set. (...). The program searches down from the root to the most derived branch of Y haplogroup tree based on derived markers supporting specific branches. Because missing data occurs frequently in ancient DNA data, this automated search often stops before the most derived position on a specific branch. This is not because the data positively support a deep-branching haplogroup but because data is missing for key informative SNPs that define particular nodes/branching points. To prevent this behavior, we manually checked haplogroup support further downstream from the last derived alleles that were detected automatically for each individual. "

If you have a better idea how to prevent this behaviour to happen, I am happy to use it and drop the command-line option to set the parameter of ancestral alleles.

Best, Alex

dpoznik commented 5 years ago

Alex, I appreciate your response. Please accept my apologies for the length of time it's taken me to reply.

To be honest, I still don't quite understand your main point:

the problem might occur that you do not have any derived SNPs but only ancestral SNPs for a certain branch but derived SNPs for a descendant branch

How could you have multiple ancestral alleles on a branch followed by derived alleles on a descending branch? If a lineage descends from branch B, it should carry the derived allele at (almost) all SNPs that define branch B. The main exceptions are (1) back-mutation, which is quite rare; and (2) genotyping error. Is the idea that with ancient DNA the genotype error rate could be extremely high?

I'm not sure I see why high levels of missingness should matter. When yhaplo observes no ancestral or derived alleles, it keeps searching. But 2 or more ancestral alleles is pretty strong evidence that the lineage does not descend from branch B.

However, there is a third exception to the rule that descendants of branch B should carry the derived allele at SNPs that define branch B: (3) the ancestral and derived alleles could be mislabeled in the ISOGG database. This certainly occurs. I've corrected the errors I was able to identify, but there are surely more. When this is the case, I agree that increasing this parameter may help in certain situations. This issue actually came up last week, and I'm curious whether the additional SNP blacklistings in ec9e7ea resolve the issue you were seeing?

Either way, although I like the default values for these parameters, I see no harm in adding the command-line options, especially if you've found tweaking their values to be of use. And I think the other option you added was nice, but I wanted to change the implementation and ordering a bit (39a530c).

Thanks again.

Best, David