SkyAPM / aiops-engine-for-skywalking

This is an incubating repository of the Apache SkyWalking AIOps Engine
https://github.com/apache/skywalking/discussions/8883
Apache License 2.0
37 stars 7 forks source link

[Engine] Log data ingestion from gRPC data source #6

Closed Superskyyy closed 1 year ago

Superskyyy commented 2 years ago

As discussed with @kezhenxu94, we will integrate with SkyWalking OAP by implementing a new gRPC log exporter and a data ingestion mechanism on our side.

Goal: set up a ~gRPC~ server and implement its servicer to subscribe to the data stream for further processing.

As this requires OAP side work, we should start by defining the gRPC proto and a mock client first. So that we can postpone the actual integration after full evaluation.

~- [ ] @kezhenxu94 will help to provide the proto definition.~

~- [ ] Implement log stream handling and response flow. - grpc.aio.server~

~- [ ] Implement a mock gRPC client to test our pipeline & algorithms (streaming from some log file).~

wu-sheng commented 2 years ago

Please consider larger on logging exporting. Because once we have that on logging, we can't reject trace exporting anymore, which we rejected for years. Bufferfly effect, I am afraid.

Superskyyy commented 2 years ago

Please consider larger on logging exporting. Because once we have that on logging, we can't reject trace exporting anymore, which we rejected for years. Bufferfly effect, I am afraid.

@wu-sheng I see, may I know the reason why we want to reject trace export? Is it because of the overhead or other reasons, I'm not sure if there's a better way to tackle the log data problem.

wu-sheng commented 2 years ago

It is considering throughput impact. As exporting is blocking the stream, if the upstream(export target) is slower, it would make OAP slower. In some ways, it makes ppl feeling SkyWalking has issues.

I am not saying SkyWalking should not do, if we want to do logging export, we have to support traces. With this, we should provide a way with more self observability metrics as well as a doc to help on making us out of troubles.

Superskyyy commented 2 years ago

It is considering throughput impact. As exporting is blocking the stream, if the upstream(export target) is slower, it would make OAP slower. In some ways, it makes ppl feeling SkyWalking has issues.

I am not saying SkyWalking should not do, if we want to do logging export, we have to support traces. With this, we should provide a way with more self observability metrics as well as a doc to help on making us out of troubles.

It's indeed very true that we need to care for streaming backpressure, I'm having a Redis server on the engine side as caching storage so we can nearly always ingest at full speed, and evict old buckets of logs before memory fills up so it doesn't end up dead. After this, we pop elements from the Redis Lists from the oldest key to report predicted data (minimal) back to OAP.

As the exporters are optional functionality, we would explicitly emphasize the potential risks enabling them in the documentation on both sides.

Superskyyy commented 2 years ago

Down the road, I plan to also add a compact variant of log clustering that probably suffice using only GraphQL for other use cases.

The idea is quite simple - only pull the range of N logs(or time)before and after a service X triggers some alarm X.a. Then we cluster and return a quick local optima result. We can also trigger an on-demand analysis using this theory.

This compact version will be quite useful for manual debugging purposes but alone not enough for automatic anomaly detection.

Only with enough normal data learning will yield the most satisfactory results, that is it continuously learns from when there's no alarm or anomaly.

For boosting the normal data count we will implement a reusable template persistence mechanism to rebuild a tree even after total service reboot. As this is much more sophisticated, I choose to start from the easier non-compact version first.

wu-sheng commented 2 years ago

For using on-demand logs, are you using analyzing on-demand, rather than persistent logs? I think for one user, these two are alternatives, from my understanding. So, I prefer these two are different ways to work out.

Superskyyy commented 2 years ago

Reference in new issue

Yes, these are alternative use cases, we can try to support all of them in the future. it wouldn't require too much change on the engine side to support any kind of log data source and triggering events, the algorithm is always the same.

Superskyyy commented 2 years ago

I've created the proto on the engine side for testing ^

wu-sheng commented 2 years ago

When you feel it is ready, submit to official main repo for review. Don't go too far away, and face review challenge after that.

Superskyyy commented 1 year ago

Update: gRPC data ingestion won't do at least not for logs because of terrible Python gRPC library performance. (<2k streaming logs per second, not joking https://grafana-dot-grpc-testing.appspot.com). It's also hard to scale/load balance such a streaming connection.

But on the other hand, on a 8C16G vm with a single node Redis, the current log clustering throughput is >60k logs/second, without the running Redis/IDEs it is likely to be much higher, it can also be auto scaled.

Since Redis streams is really powerful I propose to directly export logs to Redis from OAP. And the AIOps engine will listen on the other end.

Alternative Kafka exporter will be implemented in the future.

wu-sheng commented 1 year ago

Since Redis streams is really powerful I propose to directly export logs to Redis from OAP.

I would be concerned if we accept this exporter as official. This seems to be unprofessional.

Superskyyy commented 1 year ago

Since Redis streams is really powerful I propose to directly export logs to Redis from OAP.

I would be concerned if we accept this exporter as official. This seems to be unprofessional.

@wu-sheng (I admit it sounds crazy to export massive amount of logs to an in-memory queue) It can be easily swapped to use any other mainstream MQs as the community sees fit (can be configured through its plugin design), so if Redis queue isn't a usual option, then let me straight go for maintaining a Kafka exporter at the official repo. I can start implementing now.

The current redis plugin code will be kept as a way for extending aiops engine with custom log data sources/ research purposes.

wu-sheng commented 1 year ago

This is not a block for you to continue, just a reminder. We may need to change when this goes to upstream.

Superskyyy commented 1 year ago

Closing in favor of Kafka Exporter and Flink Kafka connector is implemented.