Random Forests are slowed down by AbstractDataDistribution#getEntropy()

Zero3 commented 9 years ago

I'm testing your new RandomForestFactory and am getting some pretty good results! However, the algorithm is slower than expected.

My code trains hundreds of Random Forest classifiers on a small test dataset. I profiled it, and noticed that ~45% of the time is spent in AbstractDataDistribution#getEntropy(). I suspect that this is not supposed to happen, but if I'm wrong, and this is indeed the natural center of computation, please feel free to close this issue.

I don't know what the underlying performance bottleneck is, but I suspect that the call to MathUtil.log2(double) may be the one.

jbasilico commented 9 years ago

I took a quick look. Since this splitting function uses information gain, it ends up calling getEntropy multiple times to evaluate each possible split point. Since this is the inner most loop in the random forests, its somewhat expected to spend a lot of time there.

I did look at log2() and found that it actually was calling Math.log twice since it was using the general function for computing logs for arbitrary bases. There was already a constant for log(2) in LogMath, so I switched the method to do that.

There may be a way to speed it up by changing the computation by incrementally computing some of the entropy values and avoiding needing to loop over each category to compute the entropy. I'll have to think through it to figure out if it is likely to work and not have too many numerical stability issues.

Zero3 commented 9 years ago

Nice, thanks! I recently tested with a larger dataset that takes several seconds to train on, unlike my previous test of datasets which only took milliseconds to train on. With this dataset, I get the following CPU time distribution:

AbstractDataDistribution#getEntropy(): ~51% AbstractVectorThresholdMaximumGainLearner#computeBestGainAndThreshold(Collection, int, DefaultDataDistribution, ArrayList): ~35% DefaultDataDistribution#increment(Object, double): ~9% Everything else: ~5%

Maybe these numbers will be helpful when you get around looking at this issue at some point. Note that these numbers are based on the latest release, that is, without your optimization you mentioned above.

(I made several edits to this post to refine my numbers. I also noted that the implementation in Weka scales much better to this larger datasets, training in approximately half the time of Foundry. So I think there is potential for a significant speedup here. For reference, the Weka implementation is here and here)

Zero3 commented 8 years ago

This is the current status for me as of version 3.4.2:

Method	Time
`AbstractDataDistribution.getEntropy()`	~32%
`AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold()`	~27%
`DefaultDataDistribution.increment()`	~18%
`AbstractDecisionTreeLearner.splitData()`	~7%
Everything else	~16%

Interestingly enough, if I include java.util.HashMap in the profiling (and not just gov.sandia.*), the table looks like this:

Method	Time
`AbstractDataDistribution.getEntropy()`	~32%
`AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold()`	~28%
`HashMap.hash()`	~18%
`AbstractDecisionTreeLearner.splitData()`	~5%
Everything else	~17%

So it would appear like DefaultDataDistribution.increment() spends most of its time doing hashmap operations.

Similarly, if I add java.util.Collections instead, I get the following table:

Method	Time
`AbstractDataDistribution.getEntropy()`	~28%
`Collections.sort()`	~27%
`DefaultDataDistribution.increment()`	~19%
`AbstractDecisionTreeLearner.splitData()`	~7%
Everything else	~19%

So it would appear like AbstractVectorThresholdMaximumGainLearner.computeBestGainAndThreshold() spends most of its time sorting stuff.

algorithmfoundry / Foundry

Random Forests are slowed down by AbstractDataDistribution#getEntropy() #48