caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

Major overhaul of internals #51

Closed wdwvt1 closed 8 years ago

wdwvt1 commented 8 years ago

This PR overhauls the ST2 code to do several important things:

  1. Adds full table output - the actual contingency table that each draw from the Gibb's sampler generates is now returned by the (API function) _gibbs and _gibbs_loo.
  2. Adds _gibbs_loo method - this is the API for leave-one-out calls and can now be run in parallel in the same manner that _gibbs can.
  3. Update internals to use pandas - much of the code dealing with e.g. nonoverlapping metadata and feature sample sets has been moved to simple pandas operations.

There are a variety of bug fixes as well (no more nans, checks for fractional count tables, etc.)

The test examples work and the test code passes, but this is a significant update so if ya'll could give a thorough review that would be much appreciated. @lkursell @johnchase @gregcaporaso

gregcaporaso commented 8 years ago

@wdwvt1, ok if I get to this next week? I won't have time tomorrow.

wdwvt1 commented 8 years ago

@gregcaporaso - yeah, absolutely.

lkursell commented 8 years ago

@wdwvt1 I took a pass, but we need @gregcaporaso for the main review

gregcaporaso commented 8 years ago

Sorry guys, I'll review on Monday at the latest.

gregcaporaso commented 8 years ago

@wdwvt1, done with my review. Can you confirm that all new functions that you created have corresponding tests? For example, I don't see tests for get_samples? It's probably worth running coveralls on the code base too to get an idea of what is/isn't being tested.

wdwvt1 commented 8 years ago

@gregcaporaso - thanks for the review!

Two high level thoughts:

  1. I don't want to depend on QIIME2. I don't think there will be a gain in functionality, and the code maintenance burden (so far less than 15 lines of code) is not high. The support burden of QIIME2 will be higher I think -- QIIME2 will change rapidly as it gets fleshed out if scikit bio, biom format, and QIIME 1 are any guide. I won't have much of a say in those decisions so I'll have to accommodate the decisions downstream. I can just avoid that all together by dealing with these things here.
  2. The documentation for the API vs the CLI is starting to diverge significantly I think. Should I make two sections in the README.md for these?

Minor thoughts:

  1. I use underscores in particular functions to indicate things I am going to operate on/do more work on before returning something out of the function. Most of these underscores seem intuitive to me (feature_table -> _ft). I think I prefer short names just because I am used to one-character variables. I'll change these to improve the readability.
  2. I will add a description of what is required for the mapping file if using the CLI.
gregcaporaso commented 8 years ago

I don't want to depend on QIIME2.

Makes sense, no problem. It is all minor stuff.

The documentation for the API vs the CLI

We don't really have a public API for this yet (just a candidate) so I think that's premature, but if you want to, go for it.

I use underscores in particular functions to indicate things

I think it'd be better to not create this convention here - the leading underscore usually indicates private functionality, so it's going to be confusing to other developers who want to contribute to this. This is getting into style though, so take this as my opinion (i.e., not required for merge, just a strong suggestion).

I will add a description of what is required for the mapping file

Sounds good.

Let me know when you're ready for a re-review. Also, I think I replied to all of the questions you had - let me know if I missed any.

wdwvt1 commented 8 years ago

Thanks @gregcaporaso - I will work on these over the next two days and then ping you for re-review.

wdwvt1 commented 8 years ago

@gregcaporaso @johnchase @lkursell have addressed all the comments, and made a couple improvements to type checking etc.

If you want to take another pass that'd be great.

This commit removes the intermediate file writing - everything is stored in memory now. This is much easier to handle in a lot of cases, but in some of my large simulations I have been seeing segmentation faults. Can you guys please use this branch in your analyses and tell me if you are getting them?

wdwvt1 commented 8 years ago

Simulations with a large data set revealed a heinous memory leak: 5 minutes of run time with a large enough table would end up requesting 80gb of memory. The source of the memory leak appeared to be a cyclic reference in the Sampler class. In addition, storing the full table as output during the gibbs_sampler runs was causing significant memory allocation problems.

To resolve these problems I eliminated the Sampler class, and rewrote gibbs_sampler so that it produces only the vectors needed for calculating the proportions and the full output (but does not do it itself).

Benchmarks with this PR show that you can now do a ST2 run of 100 sinks, 350 sources, 8000 features, 10 restarts, 10 draws/restart, 100 burnins, in 30 minutes when using 6 cores, without exceeding 1gb of total memory use.

@gregcaporaso can you please review? @johnchase and @lkursell - your review would also be appreciated if you have time. @nscott1 - this PR will resolve the unreasonable runtime problems you were seeing.

This PR now solves issues #55 #58 #13 #21 #52 #41

wdwvt1 commented 8 years ago

@wasade - thanks for taking a look! The np.int32 calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32 could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv * p_v and in drawing a value from that with np.choice. Your email about sampling from a discrete distribution might lead to an improvement here as well.

wasade commented 8 years ago

Ah, makes sense with int32. It shouldn't be a problem to keep.

Do you have a profile to share? I'd be happy to help with some cythonizing if you want to go that route.

On Aug 20, 2016 17:42, "Will Van Treuren" notifications@github.com wrote:

@wasade https://github.com/wasade - thanks for taking a look! The np.int32 calls were because I was worried about memory usage. The memory usage was was not related to them at all, but I ended up leaving those in there. I am happy to remove them if you think that its poor practice/unnecessary. The only failure case I could see is if someone had counts that were greater than a np.int32 could handle. In that case, ST2 is just not an appropriate tool - there would never be enough burnins for that to be reasonably well assigned. Perhaps that bears mentioning in the documentation/paper as we write it.

You are absolutely write about the cythonization. The vast majority of code runtime is consumed in the inner loop (primarily in the updating the p_tv

  • p_v and in drawing a value from that with np.choice. Your email about sampling from a discrete distribution might lead to an improvement here as well.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241232118, or mute the thread https://github.com/notifications/unsubscribe-auth/AAc8sn8U_up3-yYH-EJKUPg_iowiugY_ks5qh58BgaJpZM4JNBOB .

gregcaporaso commented 8 years ago

@wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.

Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.

Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.

wdwvt1 commented 8 years ago

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading thing/issue you opened. Haven't seen loading a mapping file give an error and the test code in skbio doesn't have an example of a mapping file that triggers the errors the function is supposed to catch (that I saw). Is there an example of a mapping file that fails? The linked code also doesn't make sense to me, is it catching a case with leading comments?

I'm mainly interested in a review of the changes to the internal code, specifically the elimination of the 'Sampler' class and the removal of the proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" notifications@github.com wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.

Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.

Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241591129, or mute the thread https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB .

wdwvt1 commented 8 years ago

Also yes to addition of tests.

Coverage testing sounds good, though I think if I had to choose between that and more eyes on the internals of the algorithm (comparison of Dan's sourcetracker internals to our 'gibbs_sampler') I'd choose the latter.

On Aug 22, 2016 5:55 PM, "Will Van Treuren" wdwvt1@gmail.com wrote:

Thanks - I'll fix the documentation, typos, and camel case.

Have fixed previous comments except for the mapping file loading thing/issue you opened. Haven't seen loading a mapping file give an error and the test code in skbio doesn't have an example of a mapping file that triggers the errors the function is supposed to catch (that I saw). Is there an example of a mapping file that fails? The linked code also doesn't make sense to me, is it catching a case with leading comments?

I'm mainly interested in a review of the changes to the internal code, specifically the elimination of the 'Sampler' class and the removal of the proportion calculation from 'gibbs_sampler'.

On Aug 22, 2016 5:22 PM, "Greg Caporaso" notifications@github.com wrote:

@wdwvt1 https://github.com/wdwvt1 I spent some time on this today and it's looking reasonable. I didn't spend as much time as I did on my first pass, but if there are specific areas that you want me to look at in more detail I can do that. Just let me know what those are. I'd like to look through this a little more tomorrow - in particular I want to spend more time reviewing the test code.

Have you gone through the previous comments to make sure that you've addressed everything? It's hard to tell since so much changed since my last review.

Also, have you confirmed that all new methods and new functionality now have tests? I think it might be worth starting to compute test coverage on this repo. I started experimenting with this in another pull request.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biota/sourcetracker2/pull/51#issuecomment-241591129, or mute the thread https://github.com/notifications/unsubscribe-auth/AA__-Uu61QkUEfIawaD_SFI9tC-LDD5Bks5qij04gaJpZM4JNBOB .

gregcaporaso commented 8 years ago

more eyes on the internals of the algorithm

Can you link me to the implementation that you'd like me to compare against? I'm not very familiar with that code. I can do that in addition to looking at the tests.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+6.09%) to 87.912% when pulling e551825e09f3aa4c8e691de430e867e181da3a28 on wdwvt1:pd_internals into 3927d148aa27756b1ed1bdeb01a06428f8949940 on biota:master.

coveralls commented 8 years ago

Coverage Status

Coverage increased (+6.09%) to 87.912% when pulling 07f67e404b39ab8bf0880ce8cea635fc44442f52 on wdwvt1:pd_internals into 3927d148aa27756b1ed1bdeb01a06428f8949940 on biota:master.

gregcaporaso commented 8 years ago

Thanks @wdwvt1! As discussed on our call today, we can deal with the remaining items that were brought up here through the new issues that we created.