Auto-detection of minimum branch length

xflouris commented 8 years ago

After discussion with Kassian and Paschalia we have concluded on the following possible way of auto-detecting the value that should be used as minimum branch length threshold by the user.

The method is the following:

Get the number of branches of the rooted binary tree (i.e. |leaves| - 1).
Let a be the length of the smallest branch and b the length of the largest branch.
Compute the value n=ceil(log₂(a/b)+1) and create n bins, numbered 1,2,...,n.
Categorize the branch lengths in the bins such that a branch of length i gets in bin j where j=ceil(log₂((a/b)*i)+1).
By the pigeon-hole principle, there must be at least one empty bin.
If there is only one empty bin then set minimum branch length threshold to 0.
In the opposite case, set minimum branch length to 2^i-1a, where i is the first empty bin, i.e. there is no bin j, 1 <= j < i, that is empty.

I've added the description for only logscale 2. I'm not sure now whether we discussed adjustment of the logscale as well, feel free to comment.

This feature should be of low priority at the moment, and should never be forced to the user (or used as default). It could be used to suggest a threshold and give the user an overview of the branch lengths distribution.

stamatak commented 8 years ago

not sure, where does your method come from?

there already exists stuff for handling bimodal distributions, see, e.g.:

https://en.wikipedia.org/wiki/Otsu%27s_method

alexis

On 04.11.2015 16:54, Tomas Flouri wrote:

After discussion with Kassian and Paschalia we have concluded on the following possible way of auto-detecting the value that should be used as minimum branch length threshold by the user.

The method is the following:

Get the number of branches of the rooted binary tree (i.e. /|leaves|

1/).

Let /a/ be the length of the smallest branch and /b/ the length of the largest branch.

Compute the value /n=ceil(log_2 (a/b)+1)/ and create /n/ bins, numbered /1,2,...,n/.

Categorize the branch lengths in the bins such that a branch of length /i/ gets in the bin /j/ where /j=ceil(log_2 ((a/b)*i)+1)/.

By the pigeon-hole principle, there must be at least one empty bin.

If there is only one empty bin then set minimum branch length threshold to /0/.

In the opposite case, set minimum branch length to /2^i-1 a/, where /i/ is the first empty bin, i.e. there is no bin /j/, /1 <= j < i/, that is empty.

I've added the description for only logscale 2. I'm not sure now whether we discussed adjustment of the logscale as well, feel free to comment.

This feature should be of low priority at the moment, and should never be forced to the user (or used as default). It could be used to suggest a threshold and give the user an overview of the branch lengths distribution.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/delimitation/issues/47.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

Well, our method would be Step 1 for Otsu's method, i.e. it creates a visualization of the branch length distribution.

Now, Otsu's method works on images where the two distributions are clearly visible -- this is not our case because of the logscale. The branch-lengths can have several orders of magnitude difference between them, and so only logscale makes sense for visualizing them all in one plot.

If we were able to clearly visualize the two distributions (also assuming the PTP model is perfect), then Otsu's method would be one algorithm to delimit species under PTP.

About our method - we devised it ourselves, and uses the pigeon-hole principle to generate logscale intervals such that when we categorizing the branch lengths into them, there is at least one empty interval. If there are two or more empty intervals we ignore all branches before the first empty interval. The rest (of the branches) is then used in PTP. If only one gap is found, then we cannot say anything about what the minimum branch length should be.

stamatak commented 8 years ago

I can see your point, however the method does minimize the intra class variance, which is what we want, I am just not convinced that we need some method of our own, rather than using something established, have you talked with the comp. stats guys?

Maybe just sending them some plots would help, to me this really looks more like an optimization problem, in your method, it's a bit unclear what you are optimizing.

Could you send me some plots of the br-lens as well please?

thanks,

alexis

On 06.11.2015 14:07, Tomas Flouri wrote:

Well, our method would be Step 1 for Otsu's method, i.e. it creates a visualization of the branch length distribution.

Now, Otsu's method works on images where the two distributions are clearly visible -- this is not our case because of the logscale. The branch-lengths can have several orders of magnitude difference between them, and so only logscale makes sense for visualizing them all in one plot.

If we were able to clearly visualize the two distributions (also assuming the PTP model is perfect), then Otsu's method would be one algorithm to delimit species under PTP.

About our method - we devised it ourselves, and uses the pigeon-hole principle to generate logscale intervals such that when we categorizing the branch lengths into them, there is at least one empty interval. If there are two or more empty intervals we ignore all branches before the first empty interval. The rest (of the branches) is then used in PTP. If only one gap is found, then we cannot say anything about what the minimum branch length should be.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-154405113.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

Pas-Kapli commented 8 years ago

Hi Alexi,

after thinking about it a little better we came up with a different approach that will save us from a lot of trouble and explanations.

Our hypothesis was that the branch lengths that cause the problem in PTP are the very small ones that are created randomly among identical sequences. To test it, Tomas implemented a method that given a branch length threshold it returns all subtrees that consists of branch lengths smaller or equal to that threshold. Subsequently, for the taxa of each subtrees it calculates the p-distance (only mismatches among the four nucleotides) of the sequences. After testing it on some of our examples we found out that all these small branch lengths (rightmost cluster in the examples attached) correspond to p-distance=0, which means they are identical.

Given that it is perfectly justifiable to ignore the branch lengths that come from identical sequences, we will add to PTP an auto-detection step that given the alignment and the tree it will calculate the maximum branch length that defines subtrees consisting of identical sequences and set it as min_br. After detecting the min_br it will run the delimitation.

Paschalia

From: Alexis Stamatakis notifications@github.com Sent: 06 November 2015 16:35 To: Pas-Kapli/delimitation Subject: Re: [delimitation] Auto-detection of minimum branch length (#47)

I can see your point, however the method does minimize the intra class variance, which is what we want, I am just not convinced that we need some method of our own, rather than using something established, have you talked with the comp. stats guys?

Maybe just sending them some plots would help, to me this really looks more like an optimization problem, in your method, it's a bit unclear what you are optimizing.

Could you send me some plots of the br-lens as well please?

thanks,

alexis

On 06.11.2015 14:07, Tomas Flouri wrote:

Well, our method would be Step 1 for Otsu's method, i.e. it creates a visualization of the branch length distribution.

Now, Otsu's method works on images where the two distributions are clearly visible -- this is not our case because of the logscale. The branch-lengths can have several orders of magnitude difference between them, and so only logscale makes sense for visualizing them all in one plot.

If we were able to clearly visualize the two distributions (also assuming the PTP model is perfect), then Otsu's method would be one algorithm to delimit species under PTP.

About our method - we devised it ourselves, and uses the pigeon-hole principle to generate logscale intervals such that when we categorizing the branch lengths into them, there is at least one empty interval. If there are two or more empty intervals we ignore all branches before the first empty interval. The rest (of the branches) is then used in PTP. If only one gap is found, then we cannot say anything about what the minimum branch length should be.

Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-154405113.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHubhttps://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-154439859.

stamatak commented 8 years ago

Hi Paschalia,

Well this is of course the most simple and cleanest solution, we should implement this then.

Alexis

On 09.11.2015 18:36, Pas-Kapli wrote:

Hi Alexi,

after thinking about it a little better we came up with a different approach that will save us from a lot of trouble and explanations.

Our hypothesis was that the branch lengths that cause the problem in PTP are the very small ones that are created randomly among identical sequences. To test it, Tomas implemented a method that given a branch length threshold it returns all subtrees that consists of branch lengths smaller or equal to that threshold. Subsequently, for the taxa of each subtrees it calculates the p-distance (only mismatches among the four nucleotides) of the sequences. After testing it on some of our examples we found out that all these small branch lengths (rightmost cluster in the examples attached) correspond to p-distance=0, which means they are identical.

Given that it is perfectly justifiable to ignore the branch lengths that come from identical sequences, we will add to PTP an auto-detection step that given the alignment and the tree it will calculate the maximum branch length that defines subtrees consisting of identical sequences and set it as min_br. After detecting the min_br it will run the delimitation.

Paschalia

From: Alexis Stamatakis notifications@github.com Sent: 06 November 2015 16:35 To: Pas-Kapli/delimitation Subject: Re: [delimitation] Auto-detection of minimum branch length (#47)

I can see your point, however the method does minimize the intra class variance, which is what we want, I am just not convinced that we need some method of our own, rather than using something established, have you talked with the comp. stats guys?

Maybe just sending them some plots would help, to me this really looks more like an optimization problem, in your method, it's a bit unclear what you are optimizing.

Could you send me some plots of the br-lens as well please?

thanks,

alexis

On 06.11.2015 14:07, Tomas Flouri wrote:

Well, our method would be Step 1 for Otsu's method, i.e. it creates a visualization of the branch length distribution.

Now, Otsu's method works on images where the two distributions are clearly visible -- this is not our case because of the logscale. The branch-lengths can have several orders of magnitude difference between them, and so only logscale makes sense for visualizing them all in one plot.

If we were able to clearly visualize the two distributions (also assuming the PTP model is perfect), then Otsu's method would be one algorithm to delimit species under PTP.

About our method - we devised it ourselves, and uses the pigeon-hole principle to generate logscale intervals such that when we categorizing the branch lengths into them, there is at least one empty interval. If there are two or more empty intervals we ignore all branches before the first empty interval. The rest (of the branches) is then used in PTP. If only one gap is found, then we cannot say anything about what the minimum branch length should be.

Reply to this email directly or view it on GitHub

https://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-154405113.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

Reply to this email directly or view it on GitHubhttps://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-154439859.

— Reply to this email directly or view it on GitHub https://github.com/Pas-Kapli/delimitation/issues/47#issuecomment-155134741.

Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson

www.exelixis-lab.org

xflouris commented 8 years ago

started working on this

xflouris commented 8 years ago

finished.

Pas-Kapli / mptp

Auto-detection of minimum branch length #47