Closed wdwvt1 closed 8 years ago
@wdwvt1, I would be happy to work with you on some of these things to make it go faster.
For the point about temp files, is the issue that they're cluttering an output directory? In that case, it might be possible to instead have them written to a temporary directory (using Python's tempfile
module). It seems like if you have failure recovery built in already we might not want to remove it.
You might need the partial stuff - I'd have to spend a little time looking at it to do that.
My plan is to spend about 3 hours / month on ST2 maintenance, and I haven't done that yet this month, so I have some free time to work with you on this if you'd like. I could then do the release in early May. I would just check with @ajaykshatriya to see if he'd like to see these changes go in as well.
If we do this, I'd also like to add in the option to overwrite the default mapping file column names and values from the cli.
Finally, all of this would make for easier QIIME 2 integration.
Hi Greg: My general view is spending X hours updating ST2 with new or improved functionality should yield at least 2X hours in increased efficiency of data analysis per person in the next 90 days. I defer to the data scientists on whether aforementioned changes would yield such a boost.
If so, it’s worth your limited and highly-valued time to make these updates.
Best,
Ajay Kshatriya Chief Executive Officer Biota Technology, Inc. +1-650-888-6512
Confidentiality Statement: This e-mail message, and all attachments, may contain legally privileged or confidential information intended solely for the use of the individual, agent, or entity named in the e-mail. If the reader of this message is not the intended recipient, you are hereby notified that any reading, distribution, copying, or taking of action based on its contents is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately and delete this message from your system.
On Apr 18, 2016, at 4:50 AM, Greg Caporaso notifications@github.com wrote:
@wdwvt1 https://github.com/wdwvt1, I would be happy to work with you on some of these things to make it go faster.
For the point about temp files, is the issue that they're cluttering an output directory? In that case, it might be possible to instead have them written to a temporary directory (using Python's tempfile module). It seems like if you have failure recovery built in already we might not want to remove it.
You might need the partial stuff - I'd have to spend a little time looking at it to do that.
My plan is to spend about 3 hours / month on ST2 maintenance, and I haven't done that yet this month, so I have some free time to work with you on this if you'd like. I could then do the release in early May. I would just check with @ajaykshatriya https://github.com/ajaykshatriya to see if he'd like to see these changes go in as well.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/issues/34#issuecomment-211344593
Thanks @ajaykshatriya. Let's let @lkursell and @johnchase comment on that.
Thanks @wdwvt1 for these changes. @johnchase and I are really looking forward to being able to play around with the objects instead of mining text files continuously.
In regards to everything Will mentioned above:
_gibbs
call has provided us with exactly what we wanted. I think therefore that we can release what is currently available.ConditionalProbability
function to be able to modify sink/source probabilities based on external metadata, like distances, phylogenetic relatedness, etc. However, in regards to current needs, this is likely 2 months off, and we will address when Will is local in SD.
@gregcaporaso @justin212k @lkursell @johnchase @nscott1 @mjk1000
Based on working with the code this weekend to try and get @johnchase and @lkursell a simple API for launching jobs I have come to the conclusion that the code has accumulated some cruft. Cruft = technical debt that will come due at some time, so I think we need to consider refactoring. I am not advocating something major, but a set of suggestions that @lkursell and @johnchase have already essentially asked for.
Below are my thoughts, please let me know what you think
The things I think would help and plan on doing once we modify/come to agreement
gibbs_sampler
to usepd.DataFrame
's. A significant amount of code is spent insourcetracker/sourcetracker.py
as well assourcetracker/cli/gibbs.py
to ensure that arrays representing source data, sink data, and the mapping file data are in the proper order etc.DataFrame
s can help remove much of this code (@lkursell and @johnchase pointed this out). This could include converting inputbiom
tables to aDataFrame
as it would make some operations slightly easier. No necessity of this, but a possibility for a little extra gain.ConditionalProbability
to be clearer and more easily extensible. In our efforts to eke out speed improvements I wrote this class so that it would precompute everything that could be precomputed. The speed gains from this are minor based on my current understanding, and it makes understanding whats happening much less straightforward. In addition, as we move to add other layers of probability (e.g. what @lkursell) discussed in his last email, we will likely want to add them in this class. Without refactoring, it's going to be hell trying to do this.What I need help with
functools.partial
nonsense that we are using to enable passing jobs to an ipyparallel client. I don't think we should need to do this and it would get rid of a fair amount of code, but I don't understand why we need to do this in the first place. If you try and pass a function to the engines that is not wrapped bypartial
you get errors indicating the local namespace doesn't contain the variables you are trying to pass.Things I'd like to get a consensus on