brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
61 stars 44 forks source link

generate likelihoods for "fabricated subsets" in the kmeans algorithm #18

Closed pbfrandsen closed 9 years ago

pbfrandsen commented 9 years ago

The "fabricated subsets" feature requires that some sort of BIC score be assigned to subsets that we cannot analyze. To do this we must estimate the log likelihood for the subset as a whole. Since the definition of the fabricated subset is that raxml/phyml cannot analyze it, we don't have the subset log likelihood. In the first version of kmeans, we simply added up the site log likelihoods that we had conveniently generated for the clustering step, i.e.:

screen shot 2015-01-16 at 8 59 46 am

This is no longer viable since we now use TIGER site rates rather than site likelihoods.

Shall we:

  1. Since we switched to TIGER rates, I haven't seen a dataset that required fabricated subsets yet, should we get rid of the fabricated subset function altogether and throw an error if the subsets get to small?
  2. Keep the BIC of the unsplit subset and set the BIC of the problematic subset to a value that makes it, plus the new subset one BIC point better than the unsplit subset so that algorithm keeps going?
  3. Other ideas?
roblanf commented 9 years ago

I'm not sure I quite get this.

Assuming we can analyse the entire dataset (i.e. all as a single subset) we will have site likelihoods for all sites. Thus, for every new split, we will have likelihoods from a prior split that we can use, if one member of the split cannot be analysed, right?

So, can't we just carry on as we are? I.e. take the site likelihoods from the prior split if a subset can't be analysed?

As to the other proposed options: option 1 might be OK, but it does mean that we can't help anyone whose dataset generates a subset that can't be analysed. As datasets get bigger and bigger, this will become more and more of a problem. Also, there appear to be 1000's of people using PartitionFinder (>10K downloads, and likely ~500 citations this year), so it's likely that this could be a problem for some. If we can have a built in solution, I think we should.

I don't like option 2: too arbitrary.

On 17 January 2015 at 01:06, Paul Frandsen notifications@github.com wrote:

The "fabricated subsets" feature requires that some sort of BIC score be assigned to subsets that we cannot analyze. To do this we must estimate the log likelihood for the subset as a whole. Since the definition of the fabricated subset is that raxml/phyml cannot analyze it, we don't have the subset log likelihood. In the first version of kmeans, we simply added up the site log likelihoods that we had conveniently generated for the clustering step, i.e.:

[image: screen shot 2015-01-16 at 8 59 46 am] https://cloud.githubusercontent.com/assets/1823345/5777222/3b48a4f0-9d5e-11e4-9312-b202e70426a3.png

This is no longer viable since we now use TIGER site rates rather than site likelihoods.

Shall we:

1.

Since we switched to TIGER rates, I haven't seen a dataset that required fabricated subsets yet, should we get rid of the fabricated subset function altogether and throw an error if the subsets get to small? 2.

Keep the BIC of the unsplit subset and set the BIC of the problematic subset to a value that makes it, plus the new subset one BIC point better than the unsplit subset so that algorithm keeps going? 3.

Other ideas?

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/18.

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

pbfrandsen commented 9 years ago

In the likelihood rates version, we did a separate run of PhyML/RAxML to estimate GTR+G site likelihoods/rates, then parsed them for the clustering step. We no longer do this since we are using TIGER rates. So what looks like site likelihoods from before the split are actually just TIGER site rates. Adding them up doesn't mean anything since the rates are just a relative number dependent upon the other sites in the alignment.

An additional option would be to add the printing out and parsing of site likelihoods to the likelihood calculation for BIC evaluation step, but, when I've tested it, this has proven to be buggy sometimes when using raxml.

roblanf commented 9 years ago

Ah, OK, I see.

Yes. Avoiding calculating site likelihoods would be good, since it's a pain, involves parsing outputs and doubling up on things.

So, how about this. If we hit a subset we can't analyse, we just stop and output the best scheme so far as the best scheme.That way people still get a best scheme, which is likely to be better than anything else they got. We should also include an information output that states what happened.

On 17 January 2015 at 06:46, Paul Frandsen notifications@github.com wrote:

In the likelihood rates version, we did a separate run of PhyML/RAxML to estimate GTR+G site likelihoods/rates, then parsed them for the clustering step. We no longer do this since we are using TIGER rates. So what looks like site likelihoods from before the split are actually just TIGER site rates. Adding them up doesn't mean anything since the rates are just a relative number dependent upon the other sites in the alignment.

An additional option would be to add the printing out and parsing of site likelihoods to the likelihood calculation for BIC evaluation step, but, when I've tested it, this has proven to be buggy sometimes when using raxml.

— Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/18#issuecomment-70310883 .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

roblanf commented 9 years ago

Done. By dealing with everything on a subset-by-subset basis. We don't build schemes during kmeans any more.

Fabricated subsets get parked and dealt with at the end of the algorithm.

It's pretty neat, actually.