Closed wdwvt1 closed 8 years ago
@wdwvt1, ok if I get to this next week? I won't have time tomorrow.
@gregcaporaso - yeah, absolutely.
@wdwvt1 I took a pass, but we need @gregcaporaso for the main review
Sorry guys, I'll review on Monday at the latest.
@wdwvt1, done with my review. Can you confirm that all new functions that you created have corresponding tests? For example, I don't see tests for get_samples
? It's probably worth running coveralls on the code base too to get an idea of what is/isn't being tested.
@gregcaporaso - thanks for the review!
Two high level thoughts:
scikit bio
, biom format
, and QIIME 1
are any guide. I won't have much of a say in those decisions so I'll have to accommodate the decisions downstream. I can just avoid that all together by dealing with these things here.README.md
for these?Minor thoughts:
feature_table
-> _ft
). I think I prefer short names just because I am used to one-character variables. I'll change these to improve the readability.I don't want to depend on QIIME2.
Makes sense, no problem. It is all minor stuff.
The documentation for the API vs the CLI
We don't really have a public API for this yet (just a candidate) so I think that's premature, but if you want to, go for it.
I use underscores in particular functions to indicate things
I think it'd be better to not create this convention here - the leading underscore usually indicates private functionality, so it's going to be confusing to other developers who want to contribute to this. This is getting into style though, so take this as my opinion (i.e., not required for merge, just a strong suggestion).
I will add a description of what is required for the mapping file
Sounds good.
Let me know when you're ready for a re-review. Also, I think I replied to all of the questions you had - let me know if I missed any.
Thanks @gregcaporaso - I will work on these over the next two days and then ping you for re-review.
@gregcaporaso @johnchase @lkursell have addressed all the comments, and made a couple improvements to type checking etc.
If you want to take another pass that'd be great.
This commit removes the intermediate file writing - everything is stored in memory now. This is much easier to handle in a lot of cases, but in some of my large simulations I have been seeing segmentation faults. Can you guys please use this branch in your analyses and tell me if you are getting them?
Simulations with a large data set revealed a heinous memory leak: 5 minutes of run time with a large enough table would end up requesting 80gb of memory. The source of the memory leak appeared to be a cyclic reference in the Sampler
class. In addition, storing the full table as output during the gibbs_sampler
runs was causing significant memory allocation problems.
To resolve these problems I eliminated the Sampler
class, and rewrote gibbs_sampler
so that it produces only the vectors needed for calculating the proportions and the full output (but does not do it itself).
Benchmarks with this PR show that you can now do a ST2 run of 100 sinks, 350 sources, 8000 features, 10 restarts, 10 draws/restart, 100 burnins, in 30 minutes when using 6 cores, without exceeding 1gb of total memory use.
@gregcaporaso can you please review? @johnchase and @lkursell - your review would also be appreciated if you have time. @nscott1 - this PR will resolve the unreasonable runtime problems you were seeing.
This PR now solves issues #55 #58 #13 #21 #52 #41
@wasade - thanks for taking a look! The np.int32
calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32
could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.
You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv * p_v
and in drawing a value from that with np.choice
. Your email about sampling from a discrete distribution might lead to an improvement here as well.
Ah, makes sense with int32. It shouldn't be a problem to keep.
Do you have a profile to share? I'd be happy to help with some cythonizing if you want to go that route.
On Aug 20, 2016 17:42, "Will Van Treuren" notifications@github.com wrote:
@wasade https://github.com/wasade - thanks for taking a look! The np.int32 calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32 could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.
You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv
- p_v and in drawing a value from that with np.choice. Your email about sampling from a discrete distribution might lead to an improvement here as well.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241232118, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8sn8U_up3-yYH-EJKUPg_iowiugY_ks5qh58BgaJpZM4JNBOB .
@wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.
Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.
Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.
Thanks - I'll fix the documentation, typos, and camel case.
Have fixed previous comments except for the mapping file loading thing/issue you opened. Haven't seen loading a mapping file give an error and the test code in skbio doesn't have an example of a mapping file that triggers the errors the function is supposed to catch (that I saw). Is there an example of a mapping file that fails? The linked code also doesn't make sense to me, is it catching a case with leading comments?
I'm mainly interested in a review of the changes to the internal code, specifically the elimination of the 'Sampler' class and the removal of the proportion calculation from 'gibbs_sampler'.
On Aug 22, 2016 5:22 PM, "Greg Caporaso" notifications@github.com wrote:
@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.
Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.
Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241591129, or mute the thread https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB .
Also yes to addition of tests.
Coverage testing sounds good, though I think if I had to choose between that and more eyes on the internals of the algorithm (comparison of Dan's sourcetracker internals to our 'gibbs_sampler') I'd choose the latter.
On Aug 22, 2016 5:55 PM, "Will Van Treuren" wdwvt1@gmail.com wrote:
Thanks - I'll fix the documentation, typos, and camel case.
Have fixed previous comments except for the mapping file loading thing/issue you opened. Haven't seen loading a mapping file give an error and the test code in skbio doesn't have an example of a mapping file that triggers the errors the function is supposed to catch (that I saw). Is there an example of a mapping file that fails? The linked code also doesn't make sense to me, is it catching a case with leading comments?
I'm mainly interested in a review of the changes to the internal code, specifically the elimination of the 'Sampler' class and the removal of the proportion calculation from 'gibbs_sampler'.
On Aug 22, 2016 5:22 PM, "Greg Caporaso" notifications@github.com wrote:
@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.
Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.
Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241591129, or mute the thread https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB .
more eyes on the internals of the algorithm
Can you link me to the implementation that you'd like me to compare against? I'm not very familiar with that code. I can do that in addition to looking at the tests.
Thanks @wdwvt1! As discussed on our call today, we can deal with the remaining items that were brought up here through the new issues that we created.
This PR overhauls the ST2 code to do several important things:
_gibbs
and_gibbs_loo
._gibbs_loo
method - this is the API for leave-one-out calls and can now be run in parallel in the same manner that_gibbs
can.pandas
- much of the code dealing with e.g. nonoverlapping metadata and feature sample sets has been moved to simplepandas
operations.There are a variety of bug fixes as well (no more nans, checks for fractional count tables, etc.)
The test examples work and the test code passes, but this is a significant update so if ya'll could give a thorough review that would be much appreciated. @lkursell @johnchase @gregcaporaso