RubixML / ML

A high-level machine learning and deep learning library for the PHP language.
https://rubixml.com
MIT License
2.02k stars 183 forks source link

Determine the range of expected values #40

Closed ilvalerione closed 4 years ago

ilvalerione commented 4 years ago

Hi Andrew, I'm really impressed by this project, that's what PHP ecosystem needs right now.

I would use this project in my application to determine the range of expected values in a time-series data. My application is able to trace how long does it take an HTTP request to be completed (duration) that I represent in my app with a simple line chart.

Duration tend to have pronounced peaks and valleys, depending on the time of day or the day of the week. Those fluctuations make it very hard to set a simple thresholds for alerting purpose.

Based on historical data I would calculate an area-range which shows the range within which the data could be considered normal, like the example below: area-range

In this way I could have a visual feedback about the probability that current values are normal or not by analyzing a metric’s historical behavior.

Studying the documentation Logistic Regression cought my attention becouse it can be partially trained also and I can evolve its model as new data is acquired.

My Question Using this kind of algorithm for any point in my data I will get a numeric value that represents the prediction if the point is an anomaly or not.

Is there a way ML can help me to find a range of correctness for any point based on historical behavior as shown in the image above?

Are there some resources to put me in the right direction?

andrewdalpino commented 4 years ago

Hi @ilvalerione thanks for the interesting question

Let me start by making sure that our understandings are consistent

Logistic Regression is a type of Online classifier whose prediction is a class label such as 'cat', 'dog', etc. It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

As a side effect, Logistic Regression can also be used as a supervised anomaly detector where the class labels are 'anomaly', 'not anomaly.' Is this how you plan to use the estimator? As opposed to an unsupervised online anomaly detector such as Gaussian MLE?

As others have inquired about recently in issues https://github.com/RubixML/RubixML/issues/38 and https://github.com/RubixML/RubixML/issues/35, similarly your problem is one that involves non-stationary time series data, which Rubix ML does not currently support. There are models, for example ARIMA, that can handle non-stationary time series natively and, given the recent interest, I am currently looking into how models like these will fit into the Rubix ML architecture. As such, we may end up implementing time-series support in the near future.

ilvalerione commented 4 years ago

Hi @andrewdalpino thank you for your message. Reading your documentation I better understand my problem, and I appreciated your "learning purpose" contents.

It is right to emphasize that I thought from a developer point of view, so could be many details out of my skills.

I think about Logistic Regression classifying data by "hour of the day" and "day of the week":

$transactions = [
    // [duration, memory_peak, hour_of_day, day_of_week],

    [12.1, 4.2, 10, 'Saturday'],
    [20.0, 6,7, 11, 'Saturday'],
    [68.35, 12.0, 11, 'Thursday'],
];

In this way I'm trying to correlate duration and memory_peak but linking this classification to the hour and day of the week is equal to assume that data is weekly seasonal. I thought that using an online detector could mitigate the seasonal assumption changing the model over time.

It can also output a probability distribution over these classes as it implements the Probabilistic interface. Is this what you mean by 'range of correctness?'

Yes, I thought to use this information to build the "dynamic grey band" in the chart.

I'm not sure tha classifiers are the right choose for this scenario because at the end I'm dealing with "unsupervised dataset" I thought. I'm not able to know what samples in the past are anomalies or not and train the model accordingly. I'm thinking in the way that the ability to understand if a sample is an anomaly or not should be acquired by the algorithm itself, based on the historical dataset.

Thanks to your advice I better understood Gaussian MLE, it could be another reasonable approach.

I hear more and more often about algorithms like ARIMA or SARIMA (S - seasonal).

I'm a developer that is trying to implement better solutions to solve problems. This is a coompletely new world for me, so thank you for your informations.

andrewdalpino commented 4 years ago

@ilvalerione We're glad you've chosen Rubix ML to learn - Feel free to ask questions and welcome to our community

I don't quite understand you're objective, help me understand

What is the target variable that you are trying to predict? Duration and Memory Peak?

If so, your labels will be either the duration or memory peak (not both yet since we don't support multi-label regressors yet). Since those variables are continuous in nature, you'll need a regressor to predict the value of duration or memory peak given some input features - such as the hour of the day and the day of the week (using your example). See the section of the docs on inference for more info. Note that despite having regression in the name, Logistic Regression is a classifier.

Since you have a categorical feature 'day of week' in your dataset you'll need a regressor that is compatible with both categorical and continuous features. For your case, I would recommend either a Regression Tree because it is simple, fast, and explainable. Another option is Gradient Boost which has a tutorial but may be overkill for your dataset.

Unfortunately, neither of those learners can be partially trained - however, you can transform your categorical features to continuous ones using One Hot Encoder and then you could use Adaline.

One last option is to use KNN Regressor with a Gower distance kernel (since it is compatible with both categorical and continuous data types). KNN has the added benefit of implementing the Online interface, however can be computationally intractable with large training sets.

You can obtain a 'confidence interval' or perhaps 'range of expected values' using your words by cross-validation in which the model is tested on unseen data. A report such as Residual Analysis will be able to give you error metrics such as MAE (mean absolute error) such that a MAE of 10 means that each prediction can be +/- 10.

Could it be that what you are really looking for is a way to forecast this time series so that you can predict the next k time steps starting from an initial timestamp? If so, you'll have to wait for time series support.