Closed gregcaporaso closed 7 years ago
From @wdwvt1 on January 5, 2016 17:25
See Justins work here https://github.com/biota/general/pull/24
That's a great idea!
On Mar 30, 2016, at 3:52 PM, Greg Caporaso notifications@github.com wrote:
From @wdwvt1 https://github.com/wdwvt1 on January 5, 2016 16:59
@justin212k https://github.com/justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k https://github.com/justin212k and @gregcaporaso https://github.com/gregcaporaso we think that memoization should occur on the function ConditionalProbability.calculate_cp_slice. Since this function is deterministic, and merely sets the probabilities so that np.random.choice can draw from them.
We can also use the improved lru_cache function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.
Things to note when we implement this:
The general cost of the memoization approach for N iterations of the innermost loop will be (this is my guess, please correct me if I am wrong): N lookups at O(1), growth of dictionary up to maximum size Assignment to dictionary N - x times where x is the number of times the result is already in the cache Overhead associated with keeping the cache
The savings will be: Calculations of full, joint probability vector x times.
_Copied from original issue: biota/sourcetracker2_internal#2 https://github.com/biota/sourcetracker2_internal/issues/2_
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/biota/sourcetracker2/issues/26
For most of the calls to calculate the joint probability, we likely haven't visited the particular configuration recently (if we have a lot of sources). If we need to get faster we can look at this again.
From @wdwvt1 on January 5, 2016 16:59
@justin212k had an excellent suggestion to use memoization for storing calls in some of the inner loops. Based on discussion with @justin212k and @gregcaporaso we think that memoization should occur on the function
ConditionalProbability.calculate_cp_slice
. Since this function is deterministic, and merely sets the probabilities so thatnp.random.choice
can draw from them.We can also use the improved
lru_cache
function in python3 to inspect the number of hits to the cache which will give us a good estimate of what cache size to use, how much we can save with the cache, etc.Things to note when we implement this:
2^N
- I imagine to avoid resizing or allocating memory problems.The general cost of the memoization approach for
N
iterations of the innermost loop will be (this is my guess, please correct me if I am wrong):N
lookups at O(1), growth of dictionary up to maximum size Assignment to dictionaryN - x
times wherex
is the number of times the result is already in the cache Overhead associated with keeping the cacheThe savings will be: Calculations of full, joint probability vector
x
times._Copied from original issue: biota/sourcetracker2internal#2