Open ylwu-amzn opened 2 years ago
Great discussion setup! There are two issues in your question:
For the first question re non-time series data -- the first thing to observe is that non-time series data is an ambiguous term. Let's look at a finer grouping.
1a. Suppose that you expect same/similar result if you permute the data. In this case, the fact that permutation is meaningful, implies that time/sequence number is not relevant. One topic of interest in this context is https://en.wikipedia.org/wiki/Exchangeable_random_variables. In this case shingling does not (should not) make sense. So one can use time-decay = 0 (turning off recency biased sampling) and shingle size = 1; in the case of RCF (and only for this case 1a, If this restriction is not followed then there can be severe problematic consequences ) one can call update() and create a model and then score all the points. If data movement is a concern (and it should be in the long run) we could discuss (and maybe enable in RCF3.0) bootstrap sampling to draw 4 * sampleSize points and create multiple (conditionally) independent subsamples to build the trees. Linearity of expectation would guarantee that the bootstrapped scores are the same as non-bootstrapped in expectation (but bootstrap systems would be noisier, some scores/inferences will work better, some will be worse). I would strongly recommend against partitioning the samples (because that correlation cannot be analyzed easily). Instead of materializing all the data at a single point of ingestion one can send the 400KB model to the number of nodes (say even 200) then that's 80MB. If one performs all update followed by score then if the data is more than 80 MB then there is more data movement.
1b. Suppose the sequentiality of the data matters -- in other words there is some ordering, think about past, present future. In this case the "update first, score later" would mean that the past is scored based on the future ... which is not always a great setup. In this case we should RCF/TRCF as it is used most commonly, score, followed by update on each point, with the hyperparameter shinglesize set to a desirable value. The use case of this corresponds to event driven systems - take a look at https://en.wikipedia.org/wiki/Markov_chain (the variations section) -- the shingleSize is/was intended to capture the order of the Markov Chain.
1c. Standard metric time series -- we get one value for each time gap. This is covered in 1b.
1d. We get multiple values in each time gap. This corresponds a dynamic, time dependent version of 1a. Again shingleSize should be 1 -- any other value does not make obvious sense (exception : some folks may want to perform sample path computation).
For the second question - what you're suggesting is most similar to P(recision)@N measures -- we find the top N scores and are they relevant (again this is not well defined in presence of time). I would suggest looking at top 10 (with an option for folks to filter out scores in a range -- could be by absolute values, or could be ratios of data). Why a range based filtering? Often the obvious anomalies are literally obvious and folks may not simply care about such obvious anomalies. There are two core issues: 2a. If the data is altered ever, say for cold start/initial values then you may only see the results of your own operation. There is a strong caution about automating these types of processes, it can be real bad. 2b if there are precision labels (and the data is not a time series) then is this a classification task masquerading as anomaly detection? Sometimes people test anomaly detection by checking if it is finding minor classes. I would not recommend this without knowing what the final end goal -- the reason is that folks often have a simplification in mind such as anomaly = not (normal), the reality is of course that the not() function depends on the data and its use.
We are using TRCF in AD plugin to detect anomalies in streaming time-series data. Now we are going to integrate RCF lib to MLCommons. Just curious if we can use current RCF/TRCF directly to detect anomalies in non-time-series data. For RCF, it just output scores, and I did some experiments on different data set, the anomaly score range differs a lot. For example, the anomaly score threshold is 1.0 for some data set, but 1.0 will bring too many false positives in other data sets, maybe 3.0 will be good. So maybe we should just find some way to automate the anomaly score threshold selection. A naive idea: just set anomaly ratio like 0.001, then sort the scores, find the TOP 0.1% points with highest score.