Open CAFxX opened 5 years ago
Hi @CAFxX, thanks for reaching out,
We’ll take the proposal into consideration and update this issue!
Hi @CAFxX,
Our backend is not yet ready to support zstd
. In addition, based on https://github.com/facebook/zstd#benchmarks the difference between zlib
and zstd
is about 15%.
Can you contact our support team to help you reducing your egress cost?
In addition, based on https://github.com/facebook/zstd#benchmarks the difference between
zlib
andzstd
is about 15%.
At the same compression speed yes, but that's why I explicitly asked to be able to configure the compressor level as we are willing to trade CPU time for higher compression ratios and lower egress cost.
The page you are quoting from shows significantly more than 15% improvements under these conditions.
Can you contact our support team to help you reducing your egress cost?
Already did, so far no other option has been offered.
Just as a quick benchmark based on real data, I downloaded a few MBs of our logs from the datadog UI (from non-overlapping time ranges) and tried to compress it with different levels using zstd and gzip. Results:
8849920 extract.tar
1471278 extract.tar.1.gz
1126425 extract.tar.def.gz
1089061 extract.tar.9.gz
895874 extract.tar.1.zst
810474 extract.tar.def.zst
679392 extract.tar.11.zst
584194 extract.tar.19.zst
Going from gzip 9 to zstd 11 compressed size savings are ~38%. zstd 19 is too slow, but 11 is not significantly slower than gzip 9, so we could probably even raise it further.
@CAFxX Did you use training enabled compression?
zstd --train FullPathToTrainingSet/* -o dictionaryName
zstd -D dictionaryName FILE
zstd -D dictionaryName --decompress FILE.zst
I did not use custom dictionaries; it was a very simple 2-minutes test just to measure if it would be useful.
(also I am not sure how using custom dictionaries would work for this use-case, as if clients were to use custom dictionaries, the dictionaries would need to be shared and kept in sync with the datadog ingest servers - otherwise decompression on the ingest side would not work)
I did not use custom dictionaries; it was a very simple 2-minutes test just to measure if it would be useful.
(also I am not sure how using custom dictionaries would work for this use-case, as if clients were to use custom dictionaries, the dictionaries would need to be shared and kept in sync with the datadog ingest servers - otherwise decompression on the ingest side would not work)
I was just wondering if it was this much better without the dictionary. Not sure how should this work in your case. I would probably get a day worth of data and train it once and share that with all the nodes involved. Maybe re-do this once a month. Not sure how much saving you could achieve.
A few years ago (https://github.com/DataDog/datadog-agent/pull/450) support for zstd was disabled by default, citing incompatibilities with the datadog ingest side.
We are currently facing pretty high egress costs as datadog has no ingest PoP in our GCP region (asia-northeast-1), and we would be pretty happy to trade some CPU cycles for lower egress costs.
Would it be possible to re-enable zstd support and, ideally, making the compression level configurable?