aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

Performance regression in 3.5.1 when restoring state #381

Closed z3d1k closed 1 year ago

z3d1k commented 1 year ago

I've noticed significant drop in performance in Java library when restoring from RandomCutForestState.

Benchmark data:

3.5.0
Benchmark                Mode  Cnt    Score   Error  Units
RcfBenchmark.testRestore  avgt   25  106.844 ± 3.632  ms/op

3.5.1
Benchmark                Mode  Cnt     Score     Error  Units
RcfBenchmark.testRestore  avgt   25  1307.392 ± 411.213  ms/op
@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class RcfBenchmark {

    public static final RandomCutForestMapper mapper = new RandomCutForestMapper();
    public RandomCutForestState state;
    public RandomCutForest rcf;

    private static float[] generate(int input) {
        return new float[]{(float) (20 * Math.sin(input / 10.0)), (float) (20 * Math.cos(input / 10.0))};
    }

    @Setup(Level.Trial)
    public void setup() {
        mapper.setSaveExecutorContextEnabled(true);
        rcf = RandomCutForest.builder()
                .dimensions(2 * 10)
                .shingleSize(10)
                .sampleSize(628)
                .internalShinglingEnabled(true)
                .build();
        for (int i = 0; i < 100000; i++) {
            rcf.update(generate(i));
        }
        state = mapper.toState(rcf);
    }

    @Benchmark
    public void testRestore(Blackhole bh) {
        bh.consume(mapper.toModel(state));
    }
}

Profiling data shows that most of the time spent on performing .toString operations: flamegraph

It appears that source of the problem is this operation, performing heavy string construction on every iteration:

checkArgument(range > 0, " the union is a single point " + Arrays.toString(point)
                + "or the box is inappropriate, box" + box.toString() + "factor =" + factor);

This can be resolved by either using constant string as error message for checkArgument, or, to preserve detailed error, - perform error message computation lazily if error is detected, e.g.

public static void checkArgument(boolean condition, Supplier<String> messageSupplier) {
    if (!condition) {
        throw new IllegalArgumentException(message.get());
    }
}

checkArgument(range > 0, () -> " the union is a single point " + Arrays.toString(point)
            + "or the box is inappropriate, box" + box.toString() + "factor =" + factor);
sudiptoguha commented 1 year ago

Thanks for the detailed measurements! Indeed we did see a performance hit elsewhere (RCFCast, in forecasting) and we have a PR out https://github.com/aws/random-cut-forest-by-aws/pull/378 for a couple of weeks that fixes the regression. But indeed this solution is better and let us see if we can have this as a 3.6 release.

sudiptoguha commented 1 year ago

Please check out the new version. The example here is in core/src/test/java/com/amazon/randomcutforest/state/RandomCutForestMapperTest.java, see @Test void benchmarkMappers() {

the entire roundtrip (deserialize, evaluate, serialize) seems to be about 36 ms on a commodity Mac with mapper.setSaveTreeStateEnabled(true). The default is false.

z3d1k commented 1 year ago

Thanks! I've tested the new version, the issue is fixed. Thank you for the pointer to setSaveTreeStateEnabled.