[Engine] Adopt Ray as orchestration backend

Superskyyy commented 2 years ago

I've been experimenting with the pipeline using native multiprocessing and Redis Streams/RQ recently, and it quickly becomes messy when we spawn many processes.

So I'm evaluating Ray as the backend engine to orchestrate the streaming processing jobs while supporting batch learning that anomaly detection might utilize. By far, it looks promising.

The main benefit of Ray to us includes:

Worker management (Redis Streams will be only for IN/OUT data queue, no longer a task queue),
It's much lighter than Spark/Flink.
Autoscaling.
It has a UI to monitor some critical system metrics.

@Liangshumin @Fengrui-Liu FYI, there'll be some changes to the existing designs that I communicated over chat, please pay attention to the algorithm training part as Ray offers many out-of-the-box ML features.

[x] For logs (clustering)
[ ] For metrics

Superskyyy commented 2 years ago

Things I've tested:

Pure multiprocessing with native queue - very low throughout.
Redis streams + multiprocessing - fast but complex, it cannot be scaled or reduced easily.
Redis task queue - high Redis overhead, weird to do stream processing.
Current plan for Log data:

Source (OAP) -> N*gRPC(Ingestors) -> In Stream(Redis)-> Ray Actor(Stream Consumers) -> Maskers(Preprocessors) -> Ready Stream(Redis)-> ML(Learners) -> Out Stream(Redis)-> Ray Actor (Exporters) -> Destination (OAP)

Superskyyy commented 2 years ago

I'll complete a prototype to showcase the flow over this weekend.

Superskyyy commented 2 years ago

POC: https://github.com/SkyAPM/aiops-engine-for-skywalking/pull/23

Superskyyy commented 1 year ago

Closing in favor of movement to Flink. New PoC is implemented.

SkyAPM / aiops-engine-for-skywalking

[Engine] Adopt Ray as orchestration backend #14