brettc / partitionfinder

PartitionFinder discovers optimal partitioning schemes for DNA sequences.
Other
60 stars 42 forks source link

Discuss: TIGER implementations are confusing #103

Closed roblanf closed 7 years ago

roblanf commented 8 years ago

Hey @brettc @pbfrandsen @wrightaprilm,

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

  1. Paul's TIGER code in C
  2. Brett's TIGGER code in C
  3. Paul's pure python TIGER code

Right now these can all be called from the commandline with options like --kmeans tiger or --kmeans fast_tiger. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain. Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is for Option 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R

brettc commented 8 years ago

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository here: https://github.com/brettc/tigger in any case. This would be a good starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy stuff works better. I don't think a good reason to remove it is because: "users find compiling C hard". We already distribute compiled C code (raxml and phyml). If we didn't have the entropy solution, we'd simply have to distribute some more compiled c code (which, of course, would make our lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need to right dependencies installed. The hard bit is making sure you have all the right dependencies installed (ie. the reason Anaconda is so good).

Cheers, Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc(https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm),

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

Paul's TIGER code in C Brett's TIGGER code in C Paul's pure python TIGER code

Right now these can all be called from the commandline with options like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain. Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R

— You are receiving this because you were mentioned. Reply to this email directly orview it on GitHub(https://github.com/brettc/partitionfinder/issues/103)

roblanf commented 8 years ago

Fair enough.

To continue the argument, I would say that the compiled C code we distribute right now is code that is regularly compiled by 1000's of people on lots of systems, and has dedicated support forums not run by us. My concern is that our C code has not (yet) been widely used, and we'd be doing all the support. Which is something I'd bet none of us are super excited about.

R

On 19 May 2016 at 08:35, Brett Calcott notifications@github.com wrote:

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository here: https://github.com/brettc/tigger in any case. This would be a good starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy stuff works better. I don't think a good reason to remove it is because: "users find compiling C hard". We already distribute compiled C code (raxml and phyml). If we didn't have the entropy solution, we'd simply have to distribute some more compiled c code (which, of course, would make our lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need to right dependencies installed. The hard bit is making sure you have all the right dependencies installed (ie. the reason Anaconda is so good).

Cheers, Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc( https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm ),

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

Paul's TIGER code in C Brett's TIGGER code in C Paul's pure python TIGER code

Right now these can all be called from the commandline with options like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain. Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R

— You are receiving this because you were mentioned. Reply to this email directly orview it on GitHub( https://github.com/brettc/partitionfinder/issues/103)

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-220178658

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

brettc commented 8 years ago

Yep!

On 19/05/2016, at 11:22 AM, roblanf notifications@github.com wrote:

Fair enough.

To continue the argument, I would say that the compiled C code we distribute right now is code that is regularly compiled by 1000's of people on lots of systems, and has dedicated support forums not run by us. My concern is that our C code has not (yet) been widely used, and we'd be doing all the support. Which is something I'd bet none of us are super excited about.

R

On 19 May 2016 at 08:35, Brett Calcott notifications@github.com wrote:

I vote for option 1 as well.

The python / cython code for tigger is available in a separate repository here: https://github.com/brettc/tigger in any case. This would be a good starting place for future Tiger/python projects.

A comment though: I think the reason to remove it is that the entropy stuff works better. I don't think a good reason to remove it is because: "users find compiling C hard". We already distribute compiled C code (raxml and phyml). If we didn't have the entropy solution, we'd simply have to distribute some more compiled c code (which, of course, would make our lives more difficult).

I should add that compiling tigger is, in fact, very easy. You just need to right dependencies installed. The hard bit is making sure you have all the right dependencies installed (ie. the reason Anaconda is so good).

Cheers, Brett

On 19 May 2016, 9:55 AM +1200, roblanfnotifications@github.com, wrote:

Hey@brettc( https://github.com/brettc)@pbfrandsen(https://github.com/pbfrandsen)@wrightaprilm(https://github.com/wrightaprilm ),

Here's something to figure out. Right now we have THREE implementations of TIGER, and this is confusing. They are:

Paul's TIGER code in C Brett's TIGGER code in C Paul's pure python TIGER code

Right now these can all be called from the commandline with options like--kmeans tiger`or--kmeans fast_tiger``. It's kinda confusing. My question is, what is the best thing for users here? We don't officially support TIGER for DNA or protein any more (entropy is just as good, and ~infinitely quicker), but TIGER might still be useful for morphology datasets, which are much smaller.

Here are two options, let me know which you prefer, or if there are others you think are better.

Option 1

Delete all C code, retain Paul's pure python implementation, and only have TIGER work with morphology (these datasets are small, and pure python is OK here). If users call --kmeans tiger from the commandline with DNA or protein, they get back an error.

Pros: v. simple for users, and for us to maintain. Cons: users don't get to do TIGER rates on DNA/Protein any more (do they care?)

Option 2

Retain one of the C implementations (in which case Brett/Paul need to be willing to maintain it if there are bugs, which of course there aren't, and also answer inevitable installation/compilation questions on the forum), and use that for DNA/Protein, but pure python for morphology (so that we avoid problems in the most common use case).

My vote is forOption 1. Because it's the simplest, and doesn't require users to ever compile C code, which is I think a step too far for most users, and will just put people off. We can keep the code in place for running fastTIGER and TIGGER, but just remove the options to do so, and mentions of them in the docs.

Thoughts?

R

— You are receiving this because you were mentioned. Reply to this email directly orview it on GitHub( https://github.com/brettc/partitionfinder/issues/103)

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-220178658

Rob Lanfear School of Biological Sciences, Macquarie University, Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com — You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-220186877

wrightaprilm commented 8 years ago

Option 1. Whatever streamlines maintenance and support.

pbfrandsen commented 8 years ago

I think we definitely need to simplify things with the commands (maybe just --tiger when using morphology?). I would vote to leave the code in for people that really want to use it, but not offer any support.

Paul

On Thu, May 19, 2016 at 11:46 AM April Wright notifications@github.com wrote:

Option 1. Whatever streamlines maintenance and support.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-220366457

cmayer commented 8 years ago

I think Paul made a good suggestion. Some people might want to use the code, including Paul.

Best Christoph

Am 20.05.2016 um 14:53 schrieb Paul Frandsen notifications@github.com:

I think we definitely need to simplify things with the commands (maybe just --tiger when using morphology?). I would vote to leave the code in for people that really want to use it, but not offer any support.

Paul

On Thu, May 19, 2016 at 11:46 AM April Wright notifications@github.com wrote:

Option 1. Whatever streamlines maintenance and support.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-220366457

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub


Dr. Christoph Mayer Email: c.mayer.zfmk@uni-bonn.de Tel.: +49 (0)228 9122 403

Zoologisches Forschungsmuseum Alexander Koenig

Stiftung des öffentlichen Rechts; Direktor: Prof. J. W. Wägele Sitz: Bonn


roblanf commented 8 years ago

Hey @pbfrandsen do you have your pure python TIGER code running and tested yet? If so, can you pull in changes to master, then submit a pull request to master?

Once we have it in there and working, I'll clean up the C code issues as follows:

  1. Remove all C code
  2. Leave the hooks in the Python code so that, in principle, someone (us!) could use TIGGER or fast_Tiger if we wanted.
  3. Set it up so that k-means always works with entropy by default. But that if people want to, then can call --kmeans tiger at the commandline, and this will use your Python implementation.
pbfrandsen commented 8 years ago

Hi Rob,

It's all implemented and running fine. Just need to write a couple of tests. I'm in Panama and taking off on a boat tomorrow, but I'll do my best to write a few tests this week.

Paul

sent from my mobile phone, apologies for any typos On May 31, 2016 8:46 PM, "roblanf" notifications@github.com wrote:

Hey @pbfrandsen https://github.com/pbfrandsen do you have your pure python TIGER code running and tested yet? If so, can you pull in changes to master, then submit a pull request to master?

Once we have it in there and working, I'll clean up the C code issues as follows:

  1. Remove all C code
  2. Leave the hooks in the Python code so that, in principle, someone (us!) could use TIGGER or fast_Tiger if we wanted.
  3. Set it up so that k-means always works with entropy by default. But that if people want to, then can call --kmeans tiger at the commandline, and this will use your Python implementation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/brettc/partitionfinder/issues/103#issuecomment-222870692, or mute the thread https://github.com/notifications/unsubscribe/ABvScfBY5ZmrzhDqggzeGk2DLZG3tm6Nks5qHOQZgaJpZM4IhvHP .

roblanf commented 7 years ago

Done.

Entropy is now the only option for DNA and protein - TIGER is just too slow, especially since lots of people have datasets with of the order of ~1m sites these days. Also Entropy seems to just work better.

Morphology can use entropy (default) or TIGER ('--kmeans tiger' at the command line).