brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
60 stars 42 forks source link

TIGER rates #19

Closed roblanf closed 9 years ago

roblanf commented 9 years ago

So right now the TIGER K-means keeps recalculating TIGER rates.

I am not sure this is good for two reasons:

  1. It is slow to do it, particularly in the early stages of large datasets
  2. I wonder if it could fall into something of the same trap of overfitting in some cases

So, I'm wondering if we should instead stick with calculating TIGER rates once, at the top of the algorithm, and then just stick with them. Thoughts?

pbfrandsen commented 9 years ago

I'm also a little bit weary of continually re-estimating TIGER rates, for similar reasons to what you pointed out.

However, there is one instance where I think it is a nice feature and that is when we use the subsampling function of fast_TIGER. Preliminary tests suggest that the sub sampled rates are pretty good approximations of the fully sampled rates, but occasionally a site is assigned a much different value when subsampled vs. fully sampled. In these cases, it usually gets sorted out downstream since the subsampling ceases after the alignments come down to a smaller size (right now coded at 500,000 sites).

If we do just calculate rates once, we should think about how to do it (especially with huge alignments). If we continue with the iterative splitting, it would probably take less time, since when we just use the bisection search and optimize k, we are effectively having to re-estimate likelihoods for the entire alignment (albeit broken up into subsets) at each step.

cmayer commented 9 years ago

Hi all,

the reason the TIGER rates are computed again and again for the same site is that the rate of one site depends on the subset it is located in. Basically, the TIGER rate is a mean compatibility to all other sites in the same subset. As the subset changes, there is no exact alternative to recomputing the TIGER rates. The subsampling is the only approximation which I think is valid and it is not perfect. One could store the cross compatibilities of sites in hugh matrices. This would avoid most of the recomputing, but that could require hugh amounts of storage. I would have to think about that.

Next problem: Computing likelihoods/site likelihoods can fail if the subset is small.

For TIGER rates I do not expect this to happen, at least not if the subset size is not equal to 1. If we have 2 sites the algorithm should not fail and the number will certainly be between 0 and 1. I am not sure this value will make much sense, but there will be no indication of a failure.

Only if we want to compute AIC or BIC for a subsets, we need to compute the likelihood I think. Here we could theoretically get a problem and I think we should catch a potential problem in the code without aborting the program. We should not simply stop the computation and report the best scheme found so far. I would favour to choose a likelihood or BIC for that subset, such that this subset is certainly not part of the best scheme. As a result we will continue to look for better schemes that do not contain the subset that caused a problem. AIC or BIC for subsets that are too small simply do not make sense. If we simply stop, we might report an inferior subset and confuse the user with a waring for which there is no work around. The user will start the program again and will run into the same problem. The best solution is to avoid that such subsets can make it into a valid scheme. Even if we expect this problem to be very very rare, I would handle it, since people will run the program millions, of times in the next years. This increases the chance that everything that is potentially possible will occur one day.

Best Christoph

Am 16.01.2015 um 21:36 schrieb Paul Frandsen:

I'm also a little bit weary of continually re-estimating TIGER rates, for similar reasons to what you pointed out.

However, there is one instance where I think it is a nice feature and that is when we use the subsampling function of fast_TIGER. Preliminary tests suggest that the sub sampled rates are pretty good approximations of the fully sampled rates, but occasionally a site is assigned a much different value when subsampled vs. fully sampled. In these cases, it usually gets sorted out downstream since the subsampling ceases after the alignments come down to a smaller size (right now coded at 500,000 sites).

If we do just calculate rates once, we should think about how to do it (especially with huge alignments). If we continue with the iterative splitting, it would probably take less time, since when we just use the bisection search and optimize k, we are effectively having to re-estimate likelihoods for the entire alignment (albeit broken up into subsets) at each step.

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


roblanf commented 9 years ago

Perhaps the best thing to do here is to try a few TIGER approaches on the set of test datasets we have. I appreciate that in theory the meaning of the TIGER rates depends on recalculating them, but empirically it might be OK not do do so.

E.g. we could try three levels of subsetting (e.g. no subsetting, medium subsetting, and heavy subsetting), and do all of these with iteratively calculated rates as well as rates calculated only once at the start.

We could then set the PF's default to what we think will be best in most cases, and provide command line controls to invoke the other options. That would be best both for our own testing, and for the program we release.

I will try and get some additional UCE datasets (and perhaps published anchor-tagged datasets) to try it on. I'll also see if I can get a huge server to run all of this before February (but that might be a challenge I can't meet...).

R

brettc commented 9 years ago

Note that if storage is a problem, the current database we're using is very fast and efficient (compressed HDF5). I already plan to cache some more of our calculations in our work in Feb (so that restarts are faster). The python library makes this very painless to implement.

Brett

On Fri, Jan 16, 2015 at 5:14 PM, roblanf notifications@github.com wrote:

Perhaps the best thing to do here is to try a few TIGER approaches on the set of test datasets we have. I appreciate that in theory the meaning of the TIGER rates depends on recalculating them, but empirically it might be OK not do do so.

E.g. we could try three levels of subsetting (e.g. no subsetting, medium subsetting, and heavy subsetting), and do all of these with iteratively calculated rates as well as rates calculated only once at the start.

We could then set the PF's default to what we think will be best in most cases, and provide command line controls to invoke the other options. That would be best both for our own testing, and for the program we release.

I will try and get some additional UCE datasets (and perhaps published anchor-tagged datasets) to try it on. I'll also see if I can get a huge server to run all of this before February (but that might be a challenge I can't meet...).

R

Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/19#issuecomment-70343337 .

roblanf commented 9 years ago

Cool, so let's see if we can attempt to implement a caching solution in the workshop.

R

On 18 January 2015 at 07:47, Brett Calcott notifications@github.com wrote:

Note that if storage is a problem, the current database we're using is very fast and efficient (compressed HDF5). I already plan to cache some more of our calculations in our work in Feb (so that restarts are faster). The python library makes this very painless to implement.

Brett

On Fri, Jan 16, 2015 at 5:14 PM, roblanf notifications@github.com wrote:

Perhaps the best thing to do here is to try a few TIGER approaches on the set of test datasets we have. I appreciate that in theory the meaning of the TIGER rates depends on recalculating them, but empirically it might be OK not do do so.

E.g. we could try three levels of subsetting (e.g. no subsetting, medium subsetting, and heavy subsetting), and do all of these with iteratively calculated rates as well as rates calculated only once at the start.

We could then set the PF's default to what we think will be best in most cases, and provide command line controls to invoke the other options. That would be best both for our own testing, and for the program we release.

I will try and get some additional UCE datasets (and perhaps published anchor-tagged datasets) to try it on. I'll also see if I can get a huge server to run all of this before February (but that might be a challenge I can't meet...).

R

Reply to this email directly or view it on GitHub < https://github.com/brettc/partitionfinder/issues/19#issuecomment-70343337>

.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/19#issuecomment-70383721 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

pbfrandsen commented 9 years ago

The tests sound good. I am a newly minted admin on a pretty big computer here so I can run a few of the tests. One thing that will really help is if I can parallelize the code within the fastTIGER program. Then we can figure out how to make it work with the rest of the parallelization going on within PF. The initial rate calculations will benefit most from this.

Paul

Sent from my iPad

On Jan 16, 2015, at 7:14 PM, roblanf notifications@github.com wrote:

Perhaps the best thing to do here is to try a few TIGER approaches on the set of test datasets we have. I appreciate that in theory the meaning of the TIGER rates depends on recalculating them, but empirically it might be OK not do do so.

E.g. we could try three levels of subsetting (e.g. no subsetting, medium subsetting, and heavy subsetting), and do all of these with iteratively calculated rates as well as rates calculated only once at the start.

We could then set the PF's default to what we think will be best in most cases, and provide command line controls to invoke the other options. That would be best both for our own testing, and for the program we release.

I will try and get some additional UCE datasets (and perhaps published anchor-tagged datasets) to try it on. I'll also see if I can get a huge server to run all of this before February (but that might be a challenge I can't meet...).

R

— Reply to this email directly or view it on GitHub.

pbfrandsen commented 9 years ago

I have iterative k-means running on the cluster here and I can run a few of the tests that you described above. Do you have a particular dataset(s) in mind that you'd like me to run? This is the test case that I am considering:

  1. Three levels of subsampling, e.g. subsampled pool of 1,000 sites, pool of 10,000 sites, and pool of 100,000 sites.
  2. Both iterative rate estimation and bisection search.

Return both AICc scores and run times for each configuration.

pbfrandsen commented 9 years ago

The first round of tests are mostly finished. The test without any subsampling is still running. I will update this thread once it is completed. Here are the results from both the iterative search (TIGER rates recalculated for each subset) and the bisection search (TIGER rates calculated once). I tested the McCormack_2013b dataset (30 taxa; 539,526 sites) with three different subsamples, 1,000 sites, 10,000 sites, and 100,000 sites.

Method Subsample BIC time hh:mm:ss number_subsets
iterative 1,000 2720645.842 2:57:34 21
iterative 10,000 2719334.182 4:06:17 23
iterative 100,000 2721810.284 16:33:35 31
bisection 1,000 2736486.384 1:42:58 20
bisection 10,000 2736976.034 2:52:42 32
bisection 100,000 2736832.381 6:47:04 33

As you might notice, in this dataset, the BIC doesn't seem to get better with a larger subsample. However, the BIC is always better when iterative rates are used, but the bisection search was roughly twice as fast as the iterative search.

I will update this once the non subsampled runs complete.

cmayer commented 9 years ago

Hi Paul,

do you have the BIC value for the unpartitioned data set? The trend is certainly what I expected, since the whole TIGER rates approach is based on the rates being defined as a compatibility score within the subset. During the course of the partitioning I expect the rates of each site to change dramatically.

When doing a simplification of a method my first two thoughts are: a) how do I explain this in the paper. b) what would I say as a reviewer.

For a) I would say the approximation is crude and it has only been made to speed up the computation.

b) I would try to convince the editor that this simplification completely breaks with the underlying philosophy of the approach and that the hope to get similar results by this approximation seem unjustified to me.

That said, it good that we have the results. Thanks Paul.

Interesting: For larger numbers of subsamples the result can get worse again.

Best Christoph

Am 04.02.2015 um 14:56 schrieb Paul Frandsen:

The first round of tests are mostly finished. The test without any subsampling is still running. I will update this thread once it is completed. Here are the results from both the iterative search (TIGER rates recalculated for each subset) and the bisection search (TIGER rates calculated once). I tested the McCormack_2013b dataset (30 taxa; 539,526 sites) with three different subsamples, 1,000 sites, 10,000 sites, and 100,000 sites.

Method Subsample BIC time hh:mm:ss number_subsets iterative 1,000 2720645.842 2:57:34 21 iterative 10,000 2719334.182 4:06:17 23 iterative 100,000 2721810.284 16:33:35 31 bisection 1,000 2736486.384 1:42:58 20 bisection 10,000 2736976.034 2:52:42 32 bisection 100,000 2736832.381 6:47:04 33 As you might notice, in this dataset, the BIC doesn't seem to get better with a larger subsample. However, the BIC is always better when iterative rates are used, but the bisection search was roughly twice as fast as the iterative search.

I will update this once the non subsampled runs complete.

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


cmayer commented 9 years ago

Hi Paul,

I am a bit confused what the subsamples mean. When I try to correlate this with the score and the amount of time the algorithm uses I am not sure I understand this correctly.

Best Christoph

Am 04.02.2015 um 14:56 schrieb Paul Frandsen:

The first round of tests are mostly finished. The test without any subsampling is still running. I will update this thread once it is completed. Here are the results from both the iterative search (TIGER rates recalculated for each subset) and the bisection search (TIGER rates calculated once). I tested the McCormack_2013b dataset (30 taxa; 539,526 sites) with three different subsamples, 1,000 sites, 10,000 sites, and 100,000 sites.

Method Subsample BIC time hh:mm:ss number_subsets iterative 1,000 2720645.842 2:57:34 21 iterative 10,000 2719334.182 4:06:17 23 iterative 100,000 2721810.284 16:33:35 31 bisection 1,000 2736486.384 1:42:58 20 bisection 10,000 2736976.034 2:52:42 32 bisection 100,000 2736832.381 6:47:04 33 As you might notice, in this dataset, the BIC doesn't seem to get better with a larger subsample. However, the BIC is always better when iterative rates are used, but the bisection search was roughly twice as fast as the iterative search.

I will update this once the non subsampled runs complete.

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


cmayer commented 9 years ago

Hi Paul,

do you have the BIC value for the unpartitioned data set? The trend is certainly what I expected, since the whole TIGER rates approach is based on the rates being defined as a compatibility score within the subset. During the course of the partitioning I expect the rates of each site to change dramatically.

When doing a simplification of a method my first two thoughts are: a) how do I explain this in the paper. b) what would I say as a reviewer.

For a) I would say the approximation is crude and it has only been made to speed up the computation.

b) I would try to convince the editor that this simplification completely breaks with the underlying philosophy of the approach and that the hope to get similar results by this approximation seem unjustified to me.

That said, it good that we have the results. Thanks Paul.

Interesting: For larger numbers of subsamples the result can get worse again.

Best Christoph

Am 04.02.2015 um 14:56 schrieb Paul Frandsen:

The first round of tests are mostly finished. The test without any subsampling is still running. I will update this thread once it is completed. Here are the results from both the iterative search (TIGER rates recalculated for each subset) and the bisection search (TIGER rates calculated once). I tested the McCormack_2013b dataset (30 taxa; 539,526 sites) with three different subsamples, 1,000 sites, 10,000 sites, and 100,000 sites.

Method Subsample BIC time hh:mm:ss number_subsets iterative 1,000 2720645.842 2:57:34 21 iterative 10,000 2719334.182 4:06:17 23 iterative 100,000 2721810.284 16:33:35 31 bisection 1,000 2736486.384 1:42:58 20 bisection 10,000 2736976.034 2:52:42 32 bisection 100,000 2736832.381 6:47:04 33 As you might notice, in this dataset, the BIC doesn't seem to get better with a larger subsample. However, the BIC is always better when iterative rates are used, but the bisection search was roughly twice as fast as the iterative search.

I will update this once the non subsampled runs complete.

— Reply to this email directly or view it on GitHub.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


pbfrandsen commented 9 years ago

Since the TIGER rate calculation requires comparing each site against every other site, it takes an inordinate amount of time for large alignments–O(n^2). Because of this, I implemented a subsample, where we generate a smaller pool of sites (in this test: 1,000; 10,000; and 100,000), and compare each site against that smaller pool of sites reducing the algorithm time complexity to roughly O(n).

For this test, when the subset size falls below 50,000 (or in the case of the 100,000 subsample, 200,000), we revert to the original algorithm because the number of sites to be compared is now tractable.

pbfrandsen commented 9 years ago

I'm not sure that the unpartitioned BIC score is relevant to this, since we are just testing how different sizes of subsamples perform. Nevertheless, I pulled it from the beginning of one of the runs and, if the alignment is unpartitioned, the BIC score is 3182261.2.

pbfrandsen commented 9 years ago

Ok, a whole round of tests is complete for one UCE dataset. Obviously we can't make too many sweeping generalizations from one dataset, but I think this test has been interesting.

Here is a table of the AICc scores of the different level of subsamples (along with no subsampling). Interesting to note that both algorithms returned the best AICc score with the smallest subsample.

Based on this test, my current suggestion is to keep the iterative sampling as default.

Method Subsample AICc time d:hh:mm:ss number_subsets
iterative 1,000 2716840.20 2:43:31 51
iterative 10,000 2717020.69 3:51:31 51
iterative 100,000 2718101.42 12:33:41 38
iterative none 2718090.99 1:09:10:22 39
bisection 1,000 2732890.78 3:24:29 80
bisection 10,000 2733682.12 3:36:07 64
bisection 100,000 2734571.77 6:14:45 48
bisection none 2733142.28 1:02:57:37 96
roblanf commented 9 years ago

Not done. Test results are pretty compelling, so let's keep recalculating rates.