aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
206 stars 33 forks source link

Error when updating tree #374

Closed zuoxiang95 closed 1 year ago

zuoxiang95 commented 1 year ago

hello, i get an error when updating the tree, here is my tree's config RandomCutForest.builder() .numberOfTrees(150) .dimensions(3) .sampleSize(256) .timeDecay(0.8) .outputAfter(256) .randomSeed(123) .build()

the data point is Array(2.026634, 2.2139728, 0.0), and get the following exception:

java.lang.IllegalStateException: The break point did not lie inside the expected range at com.amazon.randomcutforest.tree.RandomCutTree.randomCut(RandomCutTree.java:172) at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:227) at com.amazon.randomcutforest.tree.RandomCutTree.addPoint(RandomCutTree.java:51) at com.amazon.randomcutforest.executor.SamplerPlusTree.update(SamplerPlusTree.java:92) at com.amazon.randomcutforest.executor.SequentialForestUpdateExecutor.lambda$update$0(SequentialForestUpdateExecutor.java:39) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at com.amazon.randomcutforest.executor.SequentialForestUpdateExecutor.update(SequentialForestUpdateExecutor.java:40) at com.amazon.randomcutforest.executor.AbstractForestUpdateExecutor.update(AbstractForestUpdateExecutor.java:76) at com.amazon.randomcutforest.executor.AbstractForestUpdateExecutor.update(AbstractForestUpdateExecutor.java:69) at com.amazon.randomcutforest.RandomCutForest.update(RandomCutForest.java:601) at com.jd.analysis.RRCFDetect.processElement(JimdbClientTp.scala:146) at com.jd.analysis.RRCFDetect.processElement(JimdbClientTp.scala:118) at org.apache.flink.streaming.api.operators.KeyedProcessOperator.processElement(KeyedProcessOperator.java:83) at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask$StreamTaskNetworkOutput.emitRecord(OneInputStreamTask.java:233) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.processElement(AbstractStreamTaskNetworkInput.java:134) at org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:105) at org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65) at org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:495) at org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203) at org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:806) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:758) at org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958) at org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575) at java.lang.Thread.run(Thread.java:748)

sudiptoguha commented 1 year ago

Thanks for letting us know. The failure is that the random cut was outside the bounding box. One is guessing there is some intervening data? Just the following test did not produce the error (because it does not invoke line 172 in RandomCutTree). Given that the seed is fixed, the error should be reproducible with the intervening data.

(on an orthogonal note, 150 trees may be more than you need ... given it is only 3 dimensional data)

@Test
public void Issue374(){
    RandomCutForest forest = RandomCutForest.builder() .numberOfTrees(150) .dimensions(3) .sampleSize(256) .timeDecay(0.8) .outputAfter(256) .randomSeed(123) .build();
    forest.update(new float[] {2.026634f, 2.2139728f, 0.0f});
}
sudiptoguha commented 1 year ago

Hopefully PR 376 will resolve this specific issue.

zuoxiang95 commented 1 year ago

@sudiptoguha Thanks, I will try this version.

zuoxiang95 commented 1 year ago

hello @sudiptoguha , I can't find new version "3.5.1" jar package in maven repository website. Could you please upload the new version code?

amitgalitz commented 1 year ago

Hi @zuoxiang95, 3.5.1 should be posted to maven now.

sudiptoguha commented 1 year ago

Thanks Amit!

zuoxiang95 commented 1 year ago

Thanks, guys!