[Engine] Evaluation of ZSTD (Z-standard) compression algorithm for log data

Superskyyy commented 2 years ago

AIOps engine will receive a large amount of log data from SkyWalking, and we decided to utilize a Redis stream as the buffer before stream processing. One noticeable issue is that standard Zlib cannot compress logs well as they arrive one by one (not enough knowledge to compress), according to how-compression-algorithm-works; therefore, costing extra memory/disk.

Note we delete logs immediately from the stream after processing, but still, it's worth to compressing logs for the sake of network bandwidth and prevent overloading Redis.

So here comes ZSTD , which can facilitate our flow by two directions (I): simply replacing Zlib with ZSTD to achieve a 2x average compression speed. (II**): to utilize dictionary compressor, that is, learning from a small sample batch of logs and then using the knowledge to further boost compression, this could save extra memory/disk. (Todo evaluate how to execute the learning phase - do we learn one for each service? do we learn one unified model or periodically retrain? etc.)

Some public discussions that prove its feasibility: https://groups.google.com/g/redis-db/c/slk-c33EZ7U/m/tx81gCMDDQAJ - adoption case http://facebook.github.io/zstd/ - performance comparison https://github.com/animalize/pyzstd - target python lib for implementation

======================================= Initial Experimentation Results and suggestions are welcome:

The results below show ZSTD with dictionary training on a very small amount (first 1k, increasing to 5k doesn't help) log data from the same service would save 33% more memory/disk in storage for the remaining 500k data.

(further experiments are needed to see if generally applicable) The additional idea is that if we compose a good dataset that represents "what a normal log would look like", then it can be used as universal training data, compression ratio could be further pushed.

Note: My docker Redis bandwidth is slow.

ZLIB size of log in Megabyte 86.237173MB Time taken to send 500k messages with batch 2000: 12.09048318862915 seconds 92MB used in actual Redis key

ZSTD with dict training done training dict on first 1000 log samples func:train_zstd took: 0.06115330 sec size of log in Mega byte 54.717921MB Time taken to send 500k messages with batch 2000: 8.131911993026733 seconds 58MB used in actual Redis key

ZSTD with basic compressor [default level] size of log in Megabyte 88.285950MB Time taken to send 500k messages with batch 2000: 9.860241889953613 seconds

ZSTD with rich memory compressor [default level] (a bit decreased compression ratio) size of log in Megabyte 88.386098MB Time taken to send 500k messages with batch 2000: 9.413931131362915 seconds

wu-sheng commented 2 years ago

Notice, Redis is not allowed as a dependency in the ASF, due to license. It is OK you choose for now.

Superskyyy commented 2 years ago

Notice, Redis is not allowed as a dependency in the ASF, due to license. It is OK you choose for now.

I checked Redis-core itself is BSD3, we do not use any extensions/modules that have any code with their RSAL license. Would that still be a problem? I'm a bit confused about these things and hope to learn more. Also, in skywalking-python, we have a docker-compose.yaml that deploys Redis during test.. Does it mean that Redis can be used in dev and testing as long as the final release artifact doesn't involve it?

In the future, we could switch to ship with kvrocks, but it unfortunately doesn't fully support stream consumer group commands yet (that we heavily rely on).

wu-sheng commented 2 years ago

Are you only using Redis core? Many modules would be AGPL, even common clause.

I didn't check the features you are going to use, so, this is a reminder.

Also, you mentioned it works as a buffer, that is usually queue server role, why do you choose redis queue?

Superskyyy commented 2 years ago

Are you only using Redis core? Many modules would be AGPL, even common clause.

I didn't check the features you are going to use, so, this is a reminder.

Also, you mentioned it works as a buffer, that is usually queue server role, why do you choose redis queue?

Thanks for the clarification! I just rechecked and it's strictly only Redis core as this screenshot shows streams engine in it. And I don't plan to use anything beyond core.

There are two main reasons why I choose Redis over a full-size MQ:

We also use Redis to store machine learning model snapshots and other metadata. So the reason is to not introduce another dependency, it will be too much for a secondary system (AIOps engine) for a secondary system (SkyWalking)
I find Redis Streams provide the exact same functionalities/speed as Kafka can offer to our use case, but are easier to work with/maintain than MQs.

I plan to add support for queue-based storage (Kafka) in the long run. For now, I think Redis streams work the best.

wu-sheng commented 2 years ago

OK, like I said, for now, even for an AGPL module, it is fine. Until you want to move this into the ASF.

Superskyyy commented 2 years ago

OK, like I said, for now, even for an AGPL module, it is fine. Until you want to move this into the ASF.

Understood, Thank you!

Superskyyy commented 2 years ago

TODO: implement a self-optimizer by monitoring the metric of compression ratio, if that degrades significantly, we retrain the dictionary and propagate it to each consumer to improve compression performance.

SkyAPM / aiops-engine-for-skywalking

[Engine] Evaluation of ZSTD (Z-standard) compression algorithm for log data #19