[Algorithm] Implement real-time anomaly detection for metrics

Fengrui-Liu commented 2 years ago

The goals of our project are:

[ ] A gRPC data source for metrics (Actually I'm not quite sure with this, I'd refer to the implementation of the gRPC for log https://github.com/SkyAPM/aiops-engine-for-skywalking/issues/6 ).
[ ] Detection algorithms implementation
- [ ] Implementation
- [x] Eight different algorithms are recorded in StreamAD repo by now, more algorithms are coming.
- [ ] Pre-processing functions: holidays filter, period extraction, referring to Prophet.
- [x] Post-processing functions: anomaly score thresholder
- [ ] Evaluation
  - [ ] A wide evaluation on different datasets.
- [ ] Code review and test

Superskyyy commented 2 years ago

Thanks! For the algorithms, we can directly install package or git submodule from your repository.

Fengrui-Liu commented 2 years ago

By now, although the algorithm has not been evaluated yet, I personally prefer SPOT. It can show us clearly dynamic upper and lower bounds (Some commercial products have this function, like datadog), which is more user-friendly than others. Of course, Prophet is also an option.

Superskyyy commented 2 years ago

That is great, let's keep this preference in mind and test. I will provide you with a gRPC implementation for exporting metrics if needed, but for very early testing purposes, a simple generator function will suffice to mock a stream.

Fengrui-Liu commented 2 years ago

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

Superskyyy commented 2 years ago

Recently, I'm trying to make the benchmark. I realize that a single detector may not be enough to meet generalization. I'm trying to introduce AutoML-related technology to automatically select the best from different detectors.

Many commercial vendors use such techniques to provide reliable results, I believe it's the right direction to go. Good luck and keep me updated so we can collaborate.

Superskyyy commented 2 years ago

@Fengrui-Liu Are the metrics algorithms trained incrementally online or offline periodically? (I checked the spot paper and they say both are doable, but idk the actual tradeoff)

I'm considering in terms of orchestration. When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

Fengrui-Liu commented 2 years ago

Are the metrics algorithms trained incrementally online or offline periodically?

By now, all the algorithms that we have implemented are trained incrementally.

When there are many models (1+ for each metric stream) need to be trained at the same time, this will introduce an overhead to a single node of engine when it also has to do other computation (log analysis, ingestion, inference etc.) Python cannot handle this much easily without multiprocessing and it will most likely lead to unmaintainable code.

Exactly, computing consumption also needs to be considered. I think this can be a reason why those commercial products do not deploy complex models. But in my opinion, 1) one kind of automl-related technology only run all models (more than one) at the beginning phase, and then select few of them to continuously process the follow-up data. 2) we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

So we best to scale them out either by a periodic learning task scheduler (airflow) or assign continuous learning tasks to 1-N analyzer nodes.

Both of the options are OK for our detectors by now. For periodic detection, we can use fit() and score() separately, and for continuous learning, we use fit_socre()

The final thing will have engine core, data ingestion, analyzers (actual learner workers) each as standalone modules, so it naturally has the basis for scaling and can work/die independently.

This can be achieved by instantiating objects. @Superskyyy

Superskyyy commented 2 years ago

Good insights, I'm deciding to move away from Airflow (it was never intended for streaming ETL purposes). We will only rely on a simple MQ to implement the orchestration. In the end, this is just a secondary system to a secondary system (Monitoring platform) and should be as simple to learn as possible.

we can make the users decide which metrics to detect, because those agents can export hundreds of metrics, not each of them is helpful to solve the chaos.

Yes this is intended behaviour, the skywalking metrics exporter natively supports partial subscription.

This can be achieved by instantiating objects. @Superskyyy

The standalone modules are a common pattern in today's containerization deployments. In our case, each node only communicates via a Redis task queue, they don't even need to know the existence of others. In a local machine installation, everything will still be bundled together without any remote nodes (which I'm implementing right now, ideal for testing and first release)

SkyAPM / aiops-engine-for-skywalking

[Algorithm] Implement real-time anomaly detection for metrics #7