OpenRailAssociation / osrd

An open source web application for railway infrastructure design, capacity analysis, timetabling and simulation
https://osrd.fr
470 stars 44 forks source link

Compress payloads in requests across services #8496

Open eckter opened 3 months ago

eckter commented 3 months ago

Description

Services communicate with each other using json data in plain text. These payloads are sometimes quite large (> 10 MB). Especially for editoast -> core requests (stdcm (50MB) or infra loading (150MB)).

HTTP protocol can let us compress data using various compression methods (gzip, ...). Libraries can generally be configured to enable this quite easily.

We should try to use compression and see what happens.

Technical design

  1. Benchmark the STDCM endpoint
    • We must estimate payloads size
  2. Serialized the message sent by editoast using messagepack
  3. Deserialized the received message by core
  4. Benchmark the new implementation
    • Check message size and endpoint performances
  5. Do step 2-4 with gzip
  6. Do step 2-4 with message pack + gzip
  7. Try to setup the rabbitmq management interface to decompress messages for readability.
  8. Reduce the rabbitmq ram limits in osrd-chart.

Definition of ready

woshilapin commented 2 months ago

A few remarks:

eckter commented 2 months ago

If I understand how this works: compressing the responses is part of the http protocol. It can and should be optional.

curl has a --compressed flag that automatically decompresses the content. By default it doesn't include Accept-Encoding: gzip in the header, so it shouldn't get a compressed response.

What I had in mind with this issue is to just toggle some library flags that should make it work transparently and out of the box, this is not about adding a whole new homemade layer on top of all requests. The initial scope was just http requests.


Compressing the content of rabbitmq and redis may or may not be an "out of the box" thing as well, but that seems less likely. I'm not sure how we should handle it if it's not. We could consider doing it manually, but I'm not sure it would be beneficial.

Khoyo commented 2 months ago

RabbitMQ transports arbitrary bytes payloads, no out-of-the-box solution exists - but adding a byte compression step should be simple (and AMQP 0.9.1 does have a content encoding message property if we want to use it). There may be performance gains (json compresses very well), but if we go that route we could (also?) use a binary protocol (self-describing or not eg. MessagePack/BSON instead of protobuf).

flomonster commented 2 months ago

The main problem today is the size of messages sent to core. This requires a fairly high ram limit for rabbitMQ instances (2GB).

I haven't observed any major problems on the redis side. Maybe the performance can be improved :shrug:

Trying to activate the parameter @khoyo is talking about seems like a good start. However, planning to use a binary protocol like protobuf or bson doesn't seem relevant to me until we've validated the performance problems associated with payload size.

eckter commented 2 months ago

Notes for the 09/16 workshop:

HTTP requests

For HTTP requests, we should check whether libraries can handle compression natively. If they do, great (we should still measure performances). If they don't, we can just drop it. There's likely not much to be gained there (edit: except for the infra loading process).

RabbitMQ

(side-note: large payloads in rabbitmq may be an issue in itself)

(question: does osrdyne read the infra id from the payload? (apparently not))

Adding compression there would fix issues, but native support is limited. There's a content-encoding attribute but it's apparently not used by rabbitmq itself, it's up to us to handle it.

The issue is about debuggability. We don't want to have unreadable payloads (e.g. in the rabbitmq interface).

We could add a parameter (e.g. env variable) to add compression when sending something to the queue. When reading, we'd rely on the content encoding.

Redis

It seems to have the same issues and possible solutions as RabbitMQ, but apparently redis works fine as it is with large paylaods. We can ignore it for now and focus on RabbitMQ, and then maybe apply the solutions we've found to redis.

RabbitMQ request sizes

So the main issue is probably that we're not supposed to put large payloads in there. Could we find other solutions?

The largest payload in rabbitmq is the stdcm request input. We could remove the timetable data, and have core fetch it when receiving the request. But that would make debugging more tedious (we can't just reuse requests with no other context)

Khoyo commented 2 months ago

However, planning to use a binary protocol like protobuf or bson

Messagepack should be superior to bson for our usecase - and way easier to use than a non self-describing format like protobuf

The issue is about debuggability. We don't want to have unreadable payloads (e.g. in the rabbitmq interface).

Did we test that the management interface doesn't decompress the data if we give it the correct content-encoding? (if not, maybe we could add that in?)

eckter commented 2 months ago

Did we test that the management interface doesn't decompress the data if we give it the correct content-encoding? (if not, maybe we could add that in?)

Apparently it doesn't work, it displays a base64 string to be decoded then decompressed

echo -n 'H4sIAJD952YC/6tWykvMTVWyUlDyys/IU9JRUEpMB3GNDWoBtW1YHhsAAAA=' | base64 --decode > test.gz ; gunzip test.gz