dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.03k stars 1.88k forks source link

DetectIidSpike not finding spikes ... #5981

Open FlukeFan opened 3 years ago

FlukeFan commented 3 years ago

I forked the machinelearning-samples repository. I generated a new product-sales.csv with 250 rows (instead of 36), and random values between 20K-30K.

There is one row in the file (at line 53) with an obvious spike of 60K.

Running the program does not pick up the spike.

To Reproduce Using the machine-learning-samples code:

  1. Replace the product-sales.csv with the one above: https://github.com/FlukeFan/machinelearning-samples/blob/main/samples/csharp/end-to-end-apps/AnomalyDetection-Sales/SpikeDetectionE2EApp/Data/product-sales.csv#L53

  2. Set the number of entries in the file to 250: https://github.com/FlukeFan/machinelearning-samples/blob/main/samples/csharp/end-to-end-apps/AnomalyDetection-Sales/SpikeDetectionE2EApp/SpikeDetection.ModelTrainer/Program.cs#L29

  3. Run the program - no spike is detected.

Expected behavior I would expect the DetectIidSpike transform to identify the spike.

Screenshots, Code, Sample Projects The code I used is a fork here: https://github.com/FlukeFan/machinelearning-samples/tree/main/samples/csharp/end-to-end-apps/AnomalyDetection-Sales

There was a single commit to generate the test data: https://github.com/FlukeFan/machinelearning-samples/commit/d838f17b16d5cce71397416a9481711e6089c600

Additional context I notice that reducing the pvalueHistoryLength to about 50 highlights the spike, however I would expect larger values of the sliding window to also pick up the spike.

I had originally observed this problem when trying the spike detection on a much much larger file (many thousands of records), but I have reduced it to the 250 line file the demonstrates the problem.

michaelgsharp commented 3 years ago

Hey @FlukeFan, I have looked into this and figured out what's going on and I want your thoughts on it.

Currently, this is "by design". What I mean by that is our code currently doesn't alert on anything until you have loaded a number of rows equal to pvalueHistoryLength you have chosen. Since we are streaming in rows 1 at a time, this is to prevent alerts on data that is completely fine, but because of the lack of rows currently read they are a "spike". This is why when the pvalueHistoryLength is 52 your sample data does catch the spike, but when its 53 it doesn't.

So this is what I want your thoughts on. Do you agree with that choice? Why/why not? Is this something you would want the ability to change (like spike on an anomaly before you have read pvalueHistoryLength number of rows)? Or would you rather have it read in pvalueHistoryLength number of rows before it starts doing any calculations (this would mean we would have to buffer in memory pvalueHistoryLength number of rows, could cause memory issues if set too high)?

I think the final one makes the most sense to me, but its also the most complex to handle.

FlukeFan commented 3 years ago

Hi @michaelgsharp, thanks for looking into this.

Do you agree with that choice?

So if I was implementing it, I would have done the same thing. However, as a user of the library, the first thing I would try (to test it) is to put an obvious anomaly on the first line of a test file - so while this might not occur in 'real' data nearly as often, it's a surprising result for a first-time user of the library. (and I suspect many others would walk away at that point, rather than raise/ask about it?)

(like spike on an anomaly before you have read pvalueHistoryLength number of rows)

I suspect that would run the risk of having too many false-positives (for the reason you said; 'to prevent alerts on data that is completely fine, but because of the lack of rows currently read they are a "spike"').

In the first instance, perhaps just documenting it is enough? That gives the user of the library an opportunity to think about how they want it handled.

The current example uses pvalueHistoryLength: size / 4 - it is not clear from the example that this means any anomalies in the first quarter of the file will not be found (and is, indeed, the problem I saw). I'm not sure what the guidance should be, but size / 4 is unlikely to be useful for a lot of users.

this would mean we would have to buffer in memory pvalueHistoryLength number of rows

Having thought about this a little, I suspect if I had my pvalueHistoryLength set to (say) 200 (and I'm not sure how I would go about finding a 'good' value for this either), then I might just duplicate the first 200 rows myself (which is, in effect, pushing that buffer to the user of the library to maintain). If the input stream can be re-read from the beginning, the library could maybe offer that as an option (when this option is turned on, it reads the first pvalueHistoryLength rows, then resets the stream to the beginning and starts again with spike detection on)?

FlukeFan commented 3 years ago

@michaelgsharp I also wonder if there should be a second parameter - trainingLength (>=1), which specifies the initial number of records that will be used for 'training' the model. This way, it's less of a surprise that the first n records do not have spikes detected on them. Setting this to 1 would be the same as the option you describe to ' spike on an anomaly before you have read pvalueHistoryLength number of rows', but where it wouldn't (couldn't?) ever spike on the first record?