[Algorithm] Implement incremental clustering of streaming log based on Drain

Superskyyy commented 2 years ago

After the initial evaluation phase, we would start our log analysis feature by implementing the Drain method.

The goal of this algorithm is to ingest a raw log record stream and to produce most likely matches of the log among learnt template clusters.

[x] Drain initial evaluation
[x] Drain implementation
- [x] Base implementation via Drain3
- [ ] DAG implementation (no-go)
[x] Drain tuning & experiments
- [ ] Adaptive parameter tuning (no-go) Based on the second paper , auto parameter tuning is definitely a huge plus for our use case, we should pursue to implement it.

Superskyyy commented 2 years ago

FYI @Liangshumin

Liangshumin commented 2 years ago

Drain initial

wu-sheng commented 2 years ago

Does drain method have been implemented? Or, are we going to implement this?

Superskyyy commented 2 years ago

Does drain method have been implemented? Or, are we going to implement this?

@wu-sheng There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance).

It could be a more than minimal modification in consideration of the original code base's LOC.

I'm thinking to have a fork here in the org for the features, then sometime in the future we cherrypick and contribute these features back to its upstream. It seems unnecessary to totally rewrite it given the core parts are very well written already and continuously receiving updates.

Superskyyy commented 2 years ago

An interesting thing I found earlier - Drain has another updated version (in its journal paper) - called DAG drain, this more recent paper describes an auto parameter tuning method with DAG implementation, do take a deep look. @Liangshumin https://arxiv.org/pdf/1806.04356.pdf There's no publically available implementation that I know of, it might be a great contribution to open-source and research community if we can reproduce it.

wu-sheng commented 2 years ago

There's a neat implementation from IBM, (MIT licensed) https://github.com/IBM/Drain3 but we need to do some customization for our integration and enhancements (including an extra cache mechanism and state persistence to bump up its performance). It could be a more than minimal modification in consideration of the original code base's LOC.

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process? I don't like the fork, as it is hard to control the boundaries of changes.

wu-sheng commented 2 years ago

About DAG drain, it would be great if we are going to implement it through MIT license too.

Superskyyy commented 2 years ago

Could you share from what perspective we are going to change it? Could we try to do git submodule(or git commit-id lock) to import this report and apply minimal changes in our repo to rebuild it in the compiling process? I don't like the fork, as it is hard to control the boundaries of changes.

In the short run (before 0.1.0) I will only request @Liangshumin to optimize the input layer to try to utilize cache lookup so we can speed it up a bit more. So it should be able to do by wrapping the library without needing to submodule it. Anyway, anything can be overridden so not a problem.

In the longer run, I or someone will work on rewriting some key methods to Cython or parallelizing the tree states using multi-processing, considering the original algorithm core part only got < 500 lines of code, it's like a major rewrite, I'd like to contribute these things back to upstream.

@wu-sheng Btw, please advise how many logs per second typically a medium-scale SkyWalking deployment would receive? I don't have too much idea but I'd like to make some understanding of the requirements so we can avoid under/over-optimizing it. It's a very fast algorithm already, but it can be extremely fast if we do the above optimization.

About DAG drain, it would be great if we are going to implement it through MIT license too.

Fair enough, since most research reproduced code projects are under MIT.

wu-sheng commented 2 years ago

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability. Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

Superskyyy commented 2 years ago

Don't worry about performance. All open source starts from MVP, and then we run the benchmark to see its capability. Performance should only be considered from architecture perspective, such as bottleneck of deployment, scaling capability.

That's true, thank you for the advice, anyway, I just tested itself can handle at least 10k+ raw logs per second given a 5 million dataset with 200+ patterns, seems like stable enough.

wu-sheng commented 2 years ago

BTW, for exporting logs to this engine, we could provide a throughput limit sampling mechanism. Such as 10k/s as the max exporting, which could make the AI Engine's payload predictable.

Superskyyy commented 1 year ago

The DAG version of Drain seems unnecessary and will not provide much enhancement. The Drain auto threshold derivation heuristic does not seem generally applicable after extensive experimentation (unless I did it wrong), so we stick to the default one for now.

I'm formalizing the algorithm implementation to our code base.

SkyAPM / aiops-engine-for-skywalking

[Algorithm] Implement incremental clustering of streaming log based on Drain #5