refactor discussion / what I think we need before release

wdwvt1 commented 8 years ago

@gregcaporaso @justin212k @lkursell @johnchase @nscott1 @mjk1000

Based on working with the code this weekend to try and get @johnchase and @lkursell a simple API for launching jobs I have come to the conclusion that the code has accumulated some cruft. Cruft = technical debt that will come due at some time, so I think we need to consider refactoring. I am not advocating something major, but a set of suggestions that @lkursell and @johnchase have already essentially asked for.

Below are my thoughts, please let me know what you think

The things I think would help and plan on doing once we modify/come to agreement

Refactoring all functions that are not the gibbs_sampler to use pd.DataFrame's. A significant amount of code is spent in sourcetracker/sourcetracker.py as well as sourcetracker/cli/gibbs.py to ensure that arrays representing source data, sink data, and the mapping file data are in the proper order etc. DataFrames can help remove much of this code (@lkursell and @johnchase pointed this out). This could include converting input biom tables to a DataFrame as it would make some operations slightly easier. No necessity of this, but a possibility for a little extra gain.
Rewriting ConditionalProbability to be clearer and more easily extensible. In our efforts to eke out speed improvements I wrote this class so that it would precompute everything that could be precomputed. The speed gains from this are minor based on my current understanding, and it makes understanding whats happening much less straightforward. In addition, as we move to add other layers of probability (e.g. what @lkursell) discussed in his last email, we will likely want to add them in this class. Without refactoring, it's going to be hell trying to do this.
Visualization needs to come with basic call I think (though perhaps leaving it in the ipython notebook is okay). It will take very little work to create a set of simple visualizations and this will make people switching to this code much more likely.

What I need help with

I'd love to get rid of the functools.partial nonsense that we are using to enable passing jobs to an ipyparallel client. I don't think we should need to do this and it would get rid of a fair amount of code, but I don't understand why we need to do this in the first place. If you try and pass a function to the engines that is not wrapped by partial you get errors indicating the local namespace doesn't contain the variables you are trying to pass.

Things I'd like to get a consensus on

Currently we are asking each engine in a cluster to operate on a single sink and write the results to an intermediate file. We did this to allow easy recovery from a failure (if you are running ST2 on 500 samples and it quits at sample 485 you'd be able to save that 97% of the work) but it means more functions and more intermediate text files. Do people think it would be better to just store data in memory and write out a final result when every computation had been completed? @lkursell has suggested that the in-memory approach might be good as he has had no mid-run failures in his entire time using this code.

gregcaporaso commented 8 years ago

@wdwvt1, I would be happy to work with you on some of these things to make it go faster.

For the point about temp files, is the issue that they're cluttering an output directory? In that case, it might be possible to instead have them written to a temporary directory (using Python's tempfile module). It seems like if you have failure recovery built in already we might not want to remove it.

You might need the partial stuff - I'd have to spend a little time looking at it to do that.

My plan is to spend about 3 hours / month on ST2 maintenance, and I haven't done that yet this month, so I have some free time to work with you on this if you'd like. I could then do the release in early May. I would just check with @ajaykshatriya to see if he'd like to see these changes go in as well.

If we do this, I'd also like to add in the option to overwrite the default mapping file column names and values from the cli.

Finally, all of this would make for easier QIIME 2 integration.

ajaykshatriya commented 8 years ago

Hi Greg: My general view is spending X hours updating ST2 with new or improved functionality should yield at least 2X hours in increased efficiency of data analysis per person in the next 90 days. I defer to the data scientists on whether aforementioned changes would yield such a boost.

If so, it’s worth your limited and highly-valued time to make these updates.

Best,

Ajay

Ajay Kshatriya Chief Executive Officer Biota Technology, Inc. +1-650-888-6512

Confidentiality Statement: This e-mail message, and all attachments, may contain legally privileged or confidential information intended solely for the use of the individual, agent, or entity named in the e-mail. If the reader of this message is not the intended recipient, you are hereby notified that any reading, distribution, copying, or taking of action based on its contents is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately and delete this message from your system.

On Apr 18, 2016, at 4:50 AM, Greg Caporaso notifications@github.com wrote:

@wdwvt1 https://github.com/wdwvt1, I would be happy to work with you on some of these things to make it go faster.

For the point about temp files, is the issue that they're cluttering an output directory? In that case, it might be possible to instead have them written to a temporary directory (using Python's tempfile module). It seems like if you have failure recovery built in already we might not want to remove it.

You might need the partial stuff - I'd have to spend a little time looking at it to do that.

My plan is to spend about 3 hours / month on ST2 maintenance, and I haven't done that yet this month, so I have some free time to work with you on this if you'd like. I could then do the release in early May. I would just check with @ajaykshatriya https://github.com/ajaykshatriya to see if he'd like to see these changes go in as well.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/issues/34#issuecomment-211344593

gregcaporaso commented 8 years ago

Thanks @ajaykshatriya. Let's let @lkursell and @johnchase comment on that.

lkursell commented 8 years ago

Thanks @wdwvt1 for these changes. @johnchase and I are really looking forward to being able to play around with the objects instead of mining text files continuously.

In regards to everything Will mentioned above:

The code at the moment is stable. The _gibbs call has provided us with exactly what we wanted. I think therefore that we can release what is currently available.
At some point, we will need to refactor the ConditionalProbability function to be able to modify sink/source probabilities based on external metadata, like distances, phylogenetic relatedness, etc. However, in regards to current needs, this is likely 2 months off, and we will address when Will is local in SD.
Same general comments as above. Release ST2 as is, when Will is local we can spend more time walking through these options in relation to our use cases.

caporaso-lab / sourcetracker2

refactor discussion / what I think we need before release #34

Ajay