Waikato / moa

MOA is an open source framework for Big Data stream mining. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation.
http://moa.cms.waikato.ac.nz/
GNU General Public License v3.0
613 stars 354 forks source link

classIndex() takes long time with AdaptiveRandomForest #227

Closed Jwata closed 3 years ago

Jwata commented 3 years ago

While profiling performance of AdaptiveRandomForest learner, I noticed that InstanceImpl.classIndex() takes long time. I then modified InstanceImpl to calculate and cache class index at initialization. The change looks like this https://github.com/Jwata/moa/pull/1.

The speed increased 4x with the modification. 53.37s -> 12.77s (CPU time)

This is the task I used.

tree_learner="(
  ARFHoeffdingTree\
  -subspaceSizeSize 60\
  -memoryEstimatePeriod 2000000\
  -gracePeriod 1000\
  -splitConfidence 0.05\
  -tieThreshold 0.0\
  -binarySplits\
  -noPrePrune\
  -leafprediction MC\
)"
learner="(
  meta.AdaptiveRandomForest\
  -numberOfJobs 1\
  -ensembleSize 100\
  -mFeaturesMode (Specified m (integer value))\
  -mFeaturesPerTreeSize 100\
  -driftDetectionMethod (ADWINChangeDetector -a 0.001)\
  -warningDetectionMethod (ADWINChangeDetector -a 0.01)\
  -treeLearner $tree_learner\
)"
task="EvaluatePrequential\
  -stream generators.RandomTreeGenerator\
  -instanceLimit 100000\
  -sampleFrequency 10000\
  -learner $learner"

java -cp `pwd`/moa/target/moa-2020.12.1-SNAPSHOT.jar \
     -javaagent:./lib/sizeofag-1.0.4.jar \
     moa.DoTask $task

But unexpectedly, this change significantly slowed down HoeffdingTree learner. 22.02s -> 10m31s (CPU time)

This is the task I used.

java -cp `pwd`/moa/target/moa-2020.12.1-SNAPSHOT.orig.jar \
     -javaagent:./lib/sizeofag-1.0.4.jar \
     moa.DoTask "EvaluatePrequential -instanceLimit 10000000 -sampleFrequency 1000000 -learner trees.HoeffdingTree"

Can anybody give insight on these performance changes?
In any cases, I think classIndex() shouldn't affect overall runtime. how can we mitigate/resolve this issue? (I expected the change I shared works for all situations, but it didn't...)

bpfa commented 3 years ago

This is very curious. Are the reported runtimes the median of 5 runs, or some other number of runs? Were there other compute jobs running on the same machine? What tool do you use to profile?

Jwata commented 3 years ago

Are the reported runtimes the median of 5 runs , or some other number of runs?

No, it's not median. But the observed runtimes are in a similar range.

Were there other compute jobs running on the same machine?

No other running jobs.

What tool do you use to profile?

I used https://github.com/jvm-profiling-tools/async-profiler. It took around 5% of total runtime https://gist.github.com/Jwata/6dfce63d5f1522adee8cb3b23d1bccb3#file-gistfile1-txt-L3258

hmgomes commented 3 years ago

I've tried your change. Some remarks:

abifet commented 3 years ago

Yes, that's right. The problem is that InstanceImpl has several constructors and the code to initialize classIndex is only used in one of the constructors. RandomTreeGenerator uses another constructor, the one with InstanceImpl(numberOfAttributes).

The solution could be for all constructors to call a method that setups the right value for classIndex.

Jwata commented 3 years ago

@hmgomes, @abifet

Thank you for trying the change on your side, and giving your insights.

As you pointed out, it looks my change isn't sufficient. but your experiment on a synthetic dataset may indicate that the overhead on classIndex() could be reduced with appropriate changes. Will look into the other constructors as you suggested.

Could you let me know the task you tried with the synthetic dataset so that I can try it?

abifet commented 3 years ago

What about only changing classIndex() with something like this?

    /**
     * The instance class index.
     */
    protected int classIndex = -1;

    /**
     * Class index.
     *
     * @return the int
     */
    @Override
    public int classIndex() {
        if (this.classIndex == -1) {
            this.classIndex = instanceHeader.classIndex();
            // return  ? classIndex : 0;
            if (classIndex == Integer.MAX_VALUE)
                if (this.instanceHeader.instanceInformation.range != null)
                    classIndex = instanceHeader.instanceInformation.range.getStart();
                else
                    classIndex = 0;
        }
        return classIndex;
    }
hmgomes commented 3 years ago

Hi @Jwata I've used the following:

EvaluatePrequential -l (meta.AdaptiveRandomForest -l (ARFHoeffdingTree -k 10 -e 2000000 -g 50 -c 0.01) -s 10 -m 10 -x (ADWINChangeDetector -a 0.001) -p (ADWINChangeDetector -a 0.01)) -e BasicClassificationPerformanceEvaluator -i 100000

Jwata commented 3 years ago

@abifet @hmgomes

Hi, Thank you for your suggestion. I moved the caching logic into classIndex() as @abifet suggested. but there is still the accuracy gap. Looking at the evaluation history, the accuracy is always around 42%, which indicates that it doesn’t learn anything from the data.

Will update if I make it work.

abifet commented 3 years ago

In my machine, the change I proposed to you is working. Can you send the link to your commit? Thanks!

Jwata commented 3 years ago

Sorry for my late reply. Here it is. https://github.com/Jwata/moa/pull/1/commits/8c91ccf23934e276535b9688de3cce583d9d0315

abifet commented 3 years ago

this.classIndex = -1; has to be outside InstanceImpl(InstanceImpl inst).

Try using

/**

Jwata commented 3 years ago

It worked indeed. But it didn't improve the runtime speed... Log: https://gist.github.com/Jwata/a97420c33d9b1ae79963930d2c7fbb1f

With my original change, the speed was 5x faster, but it didn't learn from the data. With @abifet's change, it learnt from the data properly, but the speed wasn't fast.

This indicates that my original change skipped some code executions incorrectly, which affected the runtime speed. I guess we can conclude that the proposal doesn't work.

Thank you for your helps.