caporaso-lab / sourcetracker2

SourceTracker2
BSD 3-Clause "New" or "Revised" License
62 stars 45 forks source link

Investigate memoization for ConditionalProbability.calculate_cp_slice #26

Closed gregcaporaso closed 7 years ago

gregcaporaso commented 8 years ago

From @wdwvt1 on January 5, 2016 16:59

@justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k and @gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice. Since this function is deterministic, and merely sets the probabilities so that np.random.choice can draw from them.

We can also use the improved lru_cache function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.

Things to note when we implement this:

  1. No alteration to the test code should be necessary since this isn't changing the number of calls to the PRNG.
  2. The documentation here suggests using a cache size of 2^N - I imagine to avoid resizing or allocating memory problems.

The general cost of the memoization approach for N iterations of the innermost loop will be (this is my guess, please correct me if I am wrong): N lookups at O(1), growth of dictionary up to maximum size Assignment to dictionary N - x times where x is the number of times the result is already in the cache Overhead associated with keeping the cache

The savings will be: Calculations of full, joint probability vector x times.

_Copied from original issue: biota/sourcetracker2internal#2

gregcaporaso commented 8 years ago

From @wdwvt1 on January 5, 2016 17:25

See Justins work here https://github.com/biota/general/pull/24

rob-knight commented 8 years ago

That's a great idea!

On Mar 30, 2016, at 3:52 PM, Greg Caporaso notifications@github.com wrote:

From @wdwvt1 https://github.com/wdwvt1 on January 5, 2016 16:59

@justin212k https://github.com/justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k https://github.com/justin212k and @gregcaporaso https://github.com/gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice. Since this function is deterministic, and merely sets the probabilities so that np.random.choice can draw from them.

We can also use the improved lru_cache function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.

Things to note when we implement this:

  1. No alteration to the test code should be necessary since this isn't changing the number of calls to the PRNG.
  2. The documentation here https://docs.python.org/3/library/functools.html suggests using a cache size of 2^N - I imagine to avoid resizing or allocating memory problems.

The general cost of the memoization approach for N iterations of the innermost loop will be (this is my guess, please correct me if I am wrong): N lookups at O(1), growth of dictionary up to maximum size Assignment to dictionary N - x times where x is the number of times the result is already in the cache Overhead associated with keeping the cache

The savings will be: Calculations of full, joint probability vector x times.

_Copied from original issue: biota/sourcetracker2_internal#2 https://github.com/biota/sourcetracker2_internal/issues/2_

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/issues/26

wdwvt1 commented 7 years ago

80 eliminates the need for this as far as I can see.

For most of the calls to calculate the joint probability, we likely haven't visited the particular configuration recently (if we have a lot of sources). If we need to get faster we can look at this again.