brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
60 stars 42 forks source link

Discuss: make an ML starting tree the default #106

Closed roblanf closed 7 years ago

roblanf commented 8 years ago

Hi all (plus @wrightaprilm - not sure if you get notifications),

Right now PF2 has the option to use an ML starting tree (via --ml-tree on the command line). If you don't do this, you get an NJ tree with PhyML, or an MP tree with RAxML.

It strikes me that the ML tree (which is calculated in RAxML with the fast tree option) should be the default. It's obviously much more likely to be a good tree. And on top of that, it's just as quick to calculate (don't ask) then the MP tree RAxML otherwise uses. This is partly because I multithread the calculation of the tree, so even for large datasets we can estimate it really fast.

So, does anyone have any good reasons why I shouldn't make the ml tree the default, and leave a switch on the command line to turn it off if you don't want it? (e.g. --no-ml-tree)?

R

pbfrandsen commented 8 years ago

I think it makes sense to make the ML tree default. I really can't envision a scenario where a parsimony or neighbor joining tree would be preferred (especially when they aren't faster!).

The only issue (that I don't think should be an issue) is that if people have problems with the pre-compiled binaries and they wish to use only PhyML, they will have to compile RAxML too.

Paul

On Thu, Jun 2, 2016, 7:42 AM roblanf notifications@github.com wrote:

Hi all (plus @wrightaprilm https://github.com/wrightaprilm - not sure if you get notifications),

Right now PF2 has the option to use an ML starting tree (via --ml-tree on the command line). If you don't do this, you get an NJ tree with PhyML, or an MP tree with RAxML.

It strikes me that the ML tree (which is calculated in RAxML with the fast tree option) should be the default. It's obviously much more likely to be a good tree. And on top of that, it's just as quick to calculate (don't ask) then the MP tree RAxML otherwise uses. This is partly because I multithread the calculation of the tree, so even for large datasets we can estimate it really fast.

So, does anyone have any good reasons why I shouldn't make the ml tree the default, and leave a switch on the command line to turn it off if you don't want it? (e.g. --no-ml-tree)?

R

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/brettc/partitionfinder/issues/106, or mute the thread https://github.com/notifications/unsubscribe/ABvScYgTHc6kkeUg8qEiYJfe01eu5xNhks5qHs9NgaJpZM4IsgvB .

cmayer commented 8 years ago

Hi Rob,

for 98% of the users I think it should be the better choice.

What happens if the user wants the old behavior? Eg because proceeding analyses have been done with an older version. How long dose a quick ML tree take for data sets of the size of 1Kite? Will there be an option to get the old behavior?

Apart from this, I favor the idea to use an ML tree. I was never 100% convinced that we are really sure the MP tree does not introduce a bias. Well, if the quick ML tree is not identical to the true tree the same problem could occur.

Best Christoph

Christoph Mayer Forschungsmuseum Alexander Koenig Bonn Email c.mayer.zfmk@uni-bonn.de Tel.: 0228 9122403

Am 02.06.2016 um 14:40 schrieb roblanf notifications@github.com:

Hi all (plus @wrightaprilm - not sure if you get notifications),

Right now PF2 has the option to use an ML starting tree (via --ml-tree on the command line). If you don't do this, you get an NJ tree with PhyML, or an MP tree with RAxML.

It strikes me that the ML tree (which is calculated in RAxML with the fast tree option) should be the default. It's obviously much more likely to be a good tree. And on top of that, it's just as quick to calculate (don't ask) then the MP tree RAxML otherwise uses. This is partly because I multithread the calculation of the tree, so even for large datasets we can estimate it really fast.

So, does anyone have any good reasons why I shouldn't make the ml tree the default, and leave a switch on the command line to turn it off if you don't want it? (e.g. --no-ml-tree)?

R

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

roblanf commented 8 years ago

Thanks guys.

I will go ahead and make it the default. My guess is that if there is any bias from the starting tree, it will be less serious when we use a better starting tree.

@cmayer, I'll see how long the ML tree takes on a 1Kite-sized dataset, and report back.

I'll put an option in to allow users to revert to the old behaviour as well, so that if the ML tree is too slow (e.g. for 1Kite perhaps) or users don't want to or can't use RAxML (Paul's point) they can revert to the old behaviour.

R

roblanf commented 8 years ago

@cmayer, the ML tree takes ~12 hours on the Misof et al dataset from science (the biggest one on the repository). This is pretty decent I think, and given the length of the analyses is typically many days, I don't think 12hrs is bad.

Either way, it can be switched off. I'm going to go ahead and make this the default.

cmayer commented 8 years ago

Hi Rob,

I agree. I also checked this with a data set having more than 0.8 million aa sites and more than 190 taxa. The fast tree topology needed about 24 hours on 24 cores. This is reasonable and only contributes a small proportion to the total time of the analysis.

Best Christoph

Am 26.06.2016 um 04:39 schrieb roblanf notifications@github.com:

@cmayer, the ML tree takes ~12 hours on the Misof et al dataset from science (the biggest one on the repository). This is pretty decent I think, and given the length of the analyses is typically many days, I don't think 12hrs is bad.

Either way, it can be switched off. I'm going to go ahead and make this the default.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


roblanf commented 8 years ago

You guys have ridiculously big datasets!!!

Glad it worked OK though.

R

On 27 June 2016 at 20:33, cmayer notifications@github.com wrote:

Hi Rob,

I agree. I also checked this with a data set having more than 0.8 million aa sites and more than 190 taxa. The fast tree topology needed about 24 hours on 24 cores. This is reasonable and only contributes a small proportion to the total time of the analysis.

Best Christoph

Am 26.06.2016 um 04:39 schrieb roblanf notifications@github.com:

@cmayer, the ML tree takes ~12 hours on the Misof et al dataset from science (the biggest one on the repository). This is pretty decent I think, and given the length of the analyses is typically many days, I don't think 12hrs is bad.

Either way, it can be switched off. I'm going to go ahead and make this the default.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

  • Leibniz Institut für Biodiversität der Tiere - Adenauerallee 160 53113 Bonn, Germany www.zfmk.de

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/brettc/partitionfinder/issues/106#issuecomment-228710199, or mute the thread https://github.com/notifications/unsubscribe/AA2pE9v3kIsmoKSI9oi4xLOw6fR3ApMcks5qP6bpgaJpZM4IsgvB .

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

roblanf commented 7 years ago

done.