OpenRailAssociation / osrd

An open source web application for railway infrastructure design, capacity analysis, timetabling and simulation
https://osrd.fr
GNU Lesser General Public License v3.0
415 stars 40 forks source link

core: conflict-detection: error 500 when there is a lot of conflict #7865

Closed bloussou closed 5 days ago

bloussou commented 1 week ago

What happened?

See this scenario : https://rec-osrd.reseau.sncf.fr/operational-studies/projects/4/studies/41/scenarios/53\

image
{"status":500,"type":"editoast:coreclient:CannotExtractResponseBody","context":{"msg":"request or response body error: error reading a body from connection: end of file before message length reached"},"message":"Cannot extract Core response body: request or response body error: error reading a body from connection: end of file before message length reached"}

What did you expect to happen?

I expect to see all the conflict in the frontend without an error

How can we reproduce it (as minimally and precisely as possible)?

  1. create a timetable with thousands of conflicts
  2. try to open the scenario

What operating system, browser and environment are you using?

OSRD version (top right corner Account button > Informations)

d9655c0

eckter commented 1 week ago

It sounds like the request from editoast to core is too large and some lib on either side can't handle it.

If it's actually the origin of the bug, we could easily batch the requests. We just need to not separate the ressource uses of a single zone into different requests.

flomonster commented 1 week ago

It sounds like the request from editoast to core is too large and some lib on either side can't handle it.

If it's actually the origin of the bug, we could easily batch the requests. We just need to not separate the ressource uses of a single zone into different requests.

It looks like takes have payload limits and should be increased.

eckter commented 1 week ago

It looks like takes have payload limits and should be increased.

I didn't really find anything for that, but I may be wrong, googling takes is painful. I found ways to wrap requests to limit their sizes, but it's not used by default. If you have found a way to configure this I'd love a link.

Batching the requests should be straightforward in any case.

I tried to reproduce the bug but couldn't quite do it naively, the front-end became unresponsive before it triggered. I'll fill a large timetable with a script instead of using the GUI, but that will wait until tomorrow

flomonster commented 1 week ago

It looks like takes have payload limits and should be increased.

I didn't really find anything for that, but I may be wrong, googling takes is painful. I found ways to wrap requests to limit their sizes, but it's not used by default. If you have found a way to configure this I'd love a link.

Batching the requests should be straightforward in any case.

I tried to reproduce the bug but couldn't quite do it naively, the front-end became unresponsive before it triggered. I'll fill a large timetable with a script instead of using the GUI, but that will wait until tomorrow

I didn't find documentation either. It was a guess that the limit was coming from the HTTP server framework.

eckter commented 1 week ago

I still can't reproduce locally, it's working fine with 3k overlapping 1000km long trains (110k conflicts). Though it's getting way too long and doesn't seem to scale nicely, with 3k trains it takes 571s (465s spent in core).

At some point editoast has a stack overflow (around 3k6 trains).

I'm guessing that it's related to the deployment resources. Maybe core gets OOM-killed while building or sending the response. I'll try to play around that to reproduce, but if that's the cause of the issue the fix won't be easy.

eckter commented 1 week ago

After looking at a profiler and using different -Xmx values:

When the memory limit is far away, memory use does peak while sending the response. It seems that the memory allocated during the conflict detection isn't freed when building the response, the peak is quite large.

But when the memory limit is close, the actual peak memory use is during the conflict detection. And it's supposed to throw a "clean" error.

My hypothesis is that the -Xmx value in deployment doesn't precisely reflects the memory that can be used without being killed. If that's the case, we should first fix it, but we can also help a little by hinting at the GC that it should do some cleanup once the conflicts are computed.

@bloussou I don't think I can look much further than that into this issue. I can open a PR for the GC hint, but I won't know if it fixes it for sure as it seems related to the deployment env.

eckter commented 5 days ago

After testing some hypothesis, core behaves perfectly fine when we use too much RAM. It does throw a clean "out of memory" error and the -Xmx values are fine.

After checking the logs, core did get killed while sending the response. But the RAM use was perfectly fine. It's not clear why or how it was killed.

eckter commented 5 days ago

Can't reproduce, closing.

We can open again if we get new cases of core mysteriously dying randomly.