Closed chreniuc closed 2 years ago
Yes it should work, the code is there and it looks at the settings: https://github.com/boostorg/beast/blob/b7344b0d501f23f763a76488826dde8c31f34b5d/include/boost/beast/websocket/detail/impl_base.hpp#L155
zo.reset()
resets the sliding window. It should only be called if this->pmd_config_.server_no_context_takeover
is true. Your code sets it to false. So there must be another piece of information that we are missing. Perhaps you can set breakpoints or logging in this function to help diagnose it? Thanks
I've tried to investigate this a little, until now I've noticed the following:
There are two reset methods on that zo
, one with params and one without, there are only two calls to that method after we call async_accept
.
async_accept
do_pmd_config
open_pmd
doReset(params)
doReset(params)
on_accept
When we call async write, these methods are called(from this file) and context is set to false, as it should, also, no reset calls:
async_write
pmd_enabled
pmd_enabled
deflate
do_context_takeover_write: Server. no_context: 0
I will continue investigating tomorrow.
I've continued investigating and managed to find a working fix for this.
The fix is to use zlib::Flush::sync
instead of zlib::Flush::full
, here
I did a pull request: https://github.com/boostorg/beast/pull/2318
Details:
After some time investigating, I got to the conclusion that the error must be in the way the zlib is used inside the beast library. But to be sure that what the specs say, works, first, I wanted to have a working example using the official zlib library. So I came up with this code which uses zlib and does exactly what the specs say:
string to_be_compressed{"Hello"};
char out[100] = {0};
z_stream strm; int level = 8; int flush;
strm.zalloc = Z_NULL; strm.zfree = Z_NULL; strm.opaque = Z_NULL; strm.avail_in = 0;
int ret = deflateInit2(&strm, level, Z_DEFLATED, -15, 8, Z_DEFAULT_STRATEGY); // negative for window, to get raw
// first deflate
strm.avail_in = to_be_compressed.size(); // updated by deflate
strm.next_in = reinterpret_cast<const unsigned char*>(to_be_compressed.data()); // updated by deflate
strm.avail_out = 100;
strm.next_out = reinterpret_cast<unsigned char*>(out);
// flush = Z_FULL_FLUSH; -> Does exactly what beast currently does.
flush = Z_SYNC_FLUSH;
ret = deflate(&strm, flush);
// Second deflate
strm.avail_in = to_be_compressed.size();
strm.next_in = reinterpret_cast<const unsigned char*>(to_be_compressed.data());
strm.avail_out = 100;
strm.next_out = reinterpret_cast<unsigned char*>(out);
flush = Z_SYNC_FLUSH;
ret = deflate(&strm, flush);
(void) deflateEnd(&strm);
The code from above compresses Hello
twice. I did two tests:
flush
set to Z_FULL_FLUSH
for both compressions, the result was just like beast does it atm, I've received 0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
after both compressions.flush
to Z_SYNC_FLUSH
, and I've noticed that for the first compression we receive: 0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
and for the second we receive: 0xf2 0x00 0x11 0x00 0x00
, just like the specs say.After being sure that this is something that can be accomplished via zlib library, I've started checking the code from beast.
I've checked the code from the deflate method from beast, and as you can see it's similar to the code from above(The example using zlib).
First I've noticed that Flush::sync
isn't used anywhere in this code(Z_SYNC_FLUSH
), I've just noticed zlib::Flush::none, zlib::Flush::block and zlib::Flush::full.
So I had no idea which one should be changed, so I tried to debug and see which lines are executed from that method. And I've noticed that all 3 write
calls are done(with the flush
es from above), when the deflate method is called with the Hello
string.
Then, I wanted to see exactly what does write do in each case, and the most interesting thing I've found was that when the flush
was set to zlib::Flush::full
, here, it called clear_hash().
So, instead of using zlib::Flush::full
, in the deflate method, I've used zlib::Flush::sync
.
I did a test, and it worked as we expected(as the rfc7692 says):
Hello
was compressed like this: 0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
Hello
was compressed like this: 0xf2 0x00 0x11 0x00 0x00
I also tested when the client and the server do send client_no_context_takeover
and server_no_context_takeover
:
Hello
was compressed like this: 0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
Hello
was compressed like this: 0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
Which is correct, because when one of those parameters is send in the handshake, the other end must use an empty sliding window(as the specs say). And this works because do_context_takeover_write is called, which resets the compression stream(including the sliding window).
So the fix is to use zlib::Flush::sync
instead of zlib::Flush::full
, here
The explanation for this can be found in the zlib documentation:
If flush is set to Z_FULL_FLUSH, all output is flushed as with Z_SYNC_FLUSH, and the compression state is reset so that decompression can restart from this point if previous compressed data has been damaged or if random access is desired. Using Z_FULL_FLUSH too often can seriously degrade compression.
Z_SYNC_FLUSH:
If the parameter flush is set to Z_SYNC_FLUSH, all pending output is flushed to the output buffer and the output is aligned on a byte boundary, so that the decompressor can get all input data available so far. (In particular avail_in is zero after the call if enough output space has been provided before the call.) Flushing may degrade compression for some compression algorithms and so it should be used only when necessary. This completes the current deflate block and follows it with an empty stored block that is three bits plus filler bits to the next byte, followed by four bytes (00 00 ff ff).
They do recommend to avoid flushing(Z_FULL_FLUSH
), and they also state that when you flush, it results in resetting the compression state.
Warnings:
zlib::Flush::full
, as you can see here, when passing it, it does some resets. So the question that comes to mind is: Should we call that full flush from time to time?Many thanks, I've started a full CI run against your PR. We'll know in an hour or so whether it breaks anything :)
Amazing job tracking this down, and thank you for the detailed write-up !!! I believe your fix is correct.
We did some stress tests with this fix, just to see the impact it has.
We've started two instances of our server, both receiving the same input and generating the same output, one with the beast version that doesn't have the fix and one with the fix. Below are the stats that we've obtained(the instances were stressed and ran for 28 hours, we did this multiple times in the past two weeks):
Conclusion 1: Without having this fixed, there was a 40% reduction of our bandwidth, with the fix it got up to 60%. So this fixed reduced the bandwidth with an additional 20%.
Conclusion 2: In our case, the best compression level was 5(instead of 8, which is the default). Using 8, would result in a decrease of 2% of the payload but with an addition cost of cpu of 51%. Note: This was in our use case, I cannot guarantee that the same will happen with every app.
Note: Our messages are binary data(protobuf messages serialized).
Details:
Bandwidth reduction
Abbreviations:
Total payloads:
1.6
40%
improvement1.6
, 40%
improvement; ratio(R): 2.6
, 60%
improvementWe've generated around 1 billion to 800 million of messages in these 28 hours.
Best Compression Level in our use case
Note: We've used compression level 5.
We did some tests and we've noticed that having a higher compression level has almost no benefits, but only drawbacks(in our case, the cpu increased a lot):
Abbreviations:
Total payloads:
2.6
, 62%
improvement1.1
, 9%
improvement; ratio(R): 2.8
, 65%
improvementCpu usage:
32%
increase40%
increase(L5), 83%
increase(R)We've generated around 1 billion to 600 million of messages in these 28 hours.
So for a 3% improvement to the compressed size, we pay an additional 51% of cpu.
Whether or not the cost of additional CPU load for better compression is worth it, depends on your network speed. At 1Gb/s (i.e. typical LAN), typically not; at lower speed (e.g. 100Mb/s, on the WAN), it can pay off.
The externally defined protocol I implemented does not use WebSocket deflate, but negotiates compression of the message "bodies", excluding a custom protocol-specific header, on top. This allows to use other compression algos: I tested ZLib, SLZ (ZLib compatible faster compressor, used by HAProxy), and LZ4. At LAN speeds, only LZ4 is worth it in my case, and even then, when the network gets faster, it no longer helps. (e.g. Azure gets up to 30Gb/s). At lower-speeds, ZLib wins.
FYI, and FWIW.
See for example https://github.com/lz4/lz4/issues/1006#issuecomment-869757267 for network-bandwidth dependent benefits of using various LZ4 levels.
Compression benefits are very situational. The side doing the compression pays a lot of CPU, while the side that is decompressing spends very little CPU but saves a bunch of bandwidth. It depends on the use-case if this is a net win.
This is fascinating
We activated
permessage_deflate
and we've noticed that the beast library doesn't behave as the RFC7692 specs say here, which means that it doesn't reuse the sliding window of the LZ77 algorithm.Summary of the specs:
If a client wants to send two compressed messages with the same contents(
Hello
), and the client did not receiveclient_no_context_takeover
from the server. The same thing applies to the server, when it doesn't receiveserver_no_context_takeover
. If the server sends that message twice, the first one will have the following payload:And the second message will have the following payload:
So 2 bytes are saved.
How beast does it:
I've started from the async client and sever examples, with a few minor additions, the main one was activating
permessage_deflate
, both in client and server. I've created this mini repository that contains the code to reproduce this problem(?).The most important addition to the examples is this, which activates
permessage_deflate
:After compiling the client and server, we've noticed that this doesn't work as the specs say. We've noticed that if we send that message multiple times, we send the same payload(
0xf2 0x48 0xcd 0xc9 0xc9 0x07 0x00
), which means that the beast library doesn't use that sliding window from the previous messages, it compresses the messages starting with a new sliding window for each message. This is allowed by the specs, as it is stated here:The word MAY allows both: using a new sliding window or the one used for the other messages.
The client handshake request was this(which doesn't contain
server_no_context_takeover
orclient_no_context_takeover
):The server handshake request was this(which also doesn't contain
server_no_context_takeover
orclient_no_context_takeover
):The first
Hello
send by the server has the following payloadf2 48 cd c9 c9 07 00
and the second, has the same payload. When it should've hadf2 00 11 00 00
.I also attached the complete capture, can be visualized with wireshark: capture_permessage_deflate_1.zip
We are using the following:
266
1.71.0
gcc 9.2.0
Why is this important for us?
After some investigations in our server, we noticed that most of the payloads of our messages(70-80%, we're sending up to 100k messages per second in total, to multiple ws connections) are small(from 23 bytes up to 60 bytes), and if we activate compression, the size of the compressed payload gets bigger than the size of uncompressed(if we send
Hello
, if uncompressed we will send 5 bytes, and if we compress it we will send 7 bytes). We also did a test withgzip
, we've compressed a file locally that had around 33 bytes and the resulted compressed size, was almost double(64 bytes). So we thought that the same thing might happen when we havepermessage_deflate
activated(because most of our messages are around that size). And there is also the cpu used to do that compression, so there are only disadvantages if we activate that extension for small messages.That's why we wanted to see if WebSocket protocol supports streaming compresion, and after reading the specs, based on what it says there we got to the conclusion that it supports, if it reuses the sliding window from the previous messages as it stated here in the specs.
Questions:
I've noticed that when we compress with
gzip
a file that contains the following stringLorem ipsum dolor sit amet, conse
(33 bytes), we end up with a compressed file that has65 bytes
. But when we send that text via websocket, which compresses it, the syze of the compressed payload is35 bytes
. Both are using the DEFLATE(or rfc1951 specs) algorithm for compressing(gzip and premessage_deflate, as stated in the specs), so why are these differences in sizes after compression?Thank you for taking the time to read this.