try to compress payloads in requests across services

eckter commented 1 month ago

Who would benefit from this feature?

Both

What is this feature about?

Services communicate with each other using json data in plain text. These payloads are sometimes quite large (> 10 MB). Especially for editoast -> core requests (stdcm (50MB) or infra loading (150MB)).

HTTP protocol can let us compress data using various compression methods (gzip, ...). Libraries can generally be configured to enable this quite easily.

We should try to use compression and see what happens.

Why is this feature valuable?

This feature may:

Improve performances in some cases
Fix some rabbitmq or redis issues

Anything else people should know?

This isn't something I'm personally familiar with, I've only heard that it's often beneficial.

I don't know how easy it is to implement, whether all libraries we use support it, how it integrates with the worker/queue architecture.

I'm not entirely sure it would be a performance gain, computation times should be measured and compared carefully.

woshilapin commented 6 days ago

A few remarks:

How about debuggability? Will all of our payload be readable when needed (webbrowser, curl & co., RabbitMQ UI, others ?)
Another solution than compression could be to use a binary format (Protobuff, FlatBuffers, Cap'n Proto, Avro, etc.)... but that might seriously impede the previous item about debuggability.

eckter commented 6 days ago

If I understand how this works: compressing the responses is part of the http protocol. It can and should be optional.

curl has a --compressed flag that automatically decompresses the content. By default it doesn't include Accept-Encoding: gzip in the header, so it shouldn't get a compressed response.

What I had in mind with this issue is to just toggle some library flags that should make it work transparently and out of the box, this is not about adding a whole new homemade layer on top of all requests. The initial scope was just http requests.

Compressing the content of rabbitmq and redis may or may not be an "out of the box" thing as well, but that seems less likely. I'm not sure how we should handle it if it's not. We could consider doing it manually, but I'm not sure it would be beneficial.

Khoyo commented 3 days ago

RabbitMQ transports arbitrary bytes payloads, no out-of-the-box solution exists - but adding a byte compression step should be simple (and AMQP 0.9.1 does have a content encoding message property if we want to use it). There may be performance gains (json compresses very well), but if we go that route we could (also?) use a binary protocol (self-describing or not eg. MessagePack/BSON instead of protobuf).

flomonster commented 3 days ago

The main problem today is the size of messages sent to core. This requires a fairly high ram limit for rabbitMQ instances (2GB).

I haven't observed any major problems on the redis side. Maybe the performance can be improved :shrug:

Trying to activate the parameter @khoyo is talking about seems like a good start. However, planning to use a binary protocol like protobuf or bson doesn't seem relevant to me until we've validated the performance problems associated with payload size.

eckter commented 3 days ago

Notes for the 09/16 workshop:

HTTP requests

For HTTP requests, we should check whether libraries can handle compression natively. If they do, great (we should still measure performances). If they don't, we can just drop it. There's likely not much to be gained there (edit: except for the infra loading process).

RabbitMQ

(side-note: large payloads in rabbitmq may be an issue in itself)

(question: does osrdyne read the infra id from the payload? (apparently not))

Adding compression there would fix issues, but native support is limited. There's a content-encoding attribute but it's apparently not used by rabbitmq itself, it's up to us to handle it.

The issue is about debuggability. We don't want to have unreadable payloads (e.g. in the rabbitmq interface).

We could add a parameter (e.g. env variable) to add compression when sending something to the queue. When reading, we'd rely on the content encoding.

Redis

It seems to have the same issues and possible solutions as RabbitMQ, but apparently redis works fine as it is with large paylaods. We can ignore it for now and focus on RabbitMQ, and then maybe apply the solutions we've found to redis.

RabbitMQ request sizes

So the main issue is probably that we're not supposed to put large payloads in there. Could we find other solutions?

The largest payload in rabbitmq is the stdcm request input. We could remove the timetable data, and have core fetch it when receiving the request. But that would make debugging more tedious (we can't just reuse requests with no other context)

Khoyo commented 3 days ago

However, planning to use a binary protocol like protobuf or bson

Messagepack should be superior to bson for our usecase - and way easier to use than a non self-describing format like protobuf

The issue is about debuggability. We don't want to have unreadable payloads (e.g. in the rabbitmq interface).

Did we test that the management interface doesn't decompress the data if we give it the correct content-encoding? (if not, maybe we could add that in?)

eckter commented 3 days ago

Did we test that the management interface doesn't decompress the data if we give it the correct content-encoding? (if not, maybe we could add that in?)

Apparently it doesn't work, it displays a base64 string to be decoded then decompressed

echo -n 'H4sIAJD952YC/6tWykvMTVWyUlDyys/IU9JRUEpMB3GNDWoBtW1YHhsAAAA=' | base64 --decode > test.gz ; gunzip test.gz

OpenRailAssociation / osrd