Requests running in timeout at really large jobs

ClusterCockpit / cc-backend

Web frontend and API backend server for ClusterCockpit Monitoring Framework

https://www.clustercockpit.org

MIT License

14 stars 12 forks source link

Requests running in timeout at really large jobs #228

Open oscarminus opened 7 months ago

oscarminus commented 7 months ago

We recently discovered a problem with a very large job. It is a job with a runtime of >31h on 24 nodes and 96GPUs.

We broke this down to the point that, when retrieving the metrics from the metricstore, timeouts occur at various points. On the one hand in the http client in the cc-backend (default 10s), on the other hand in the cc-metric-store server (30s). This results in incomplete data records.

The trivial solution would certainly be to increase the default timeouts. Alternatively, data compression could be considered to reduce the transmission time.

aw32 commented 7 months ago

Similar errors also occur, when loading multiple job pages in parallel (e.g. opening a list of "concurrent jobs" in new tabs). They all run into 502 errors. When reloading the pages sequentially they work. It seems, the parallel load on the server slows down the individual requests. Therefore, the effect not only occurs with singular very large jobs, but also with multiple jobs loaded in parallel. The difference being, that the pages for "normal" jobs can be successfully reloaded sequentially, while the page for the large job will always fail.

moebiusband73 commented 7 months ago

Du you run cc-metric-store on the same host as cc-backend? Does the job show up correctly if completed and loaded from the job archive?

oscarminus commented 7 months ago

Yes, metric collector and backend are running on the same machine. I will check the second question when the job has finished. Actually it's still running.