Closed oscarminus closed 3 weeks ago
Similar errors also occur, when loading multiple job pages in parallel (e.g. opening a list of "concurrent jobs" in new tabs). They all run into 502 errors. When reloading the pages sequentially they work. It seems, the parallel load on the server slows down the individual requests. Therefore, the effect not only occurs with singular very large jobs, but also with multiple jobs loaded in parallel. The difference being, that the pages for "normal" jobs can be successfully reloaded sequentially, while the page for the large job will always fail.
Du you run cc-metric-store on the same host as cc-backend? Does the job show up correctly if completed and loaded from the job archive?
Yes, metric collector and backend are running on the same machine. I will check the second question when the job has finished. Actually it's still running.
Added to current draft merge request due to included optimizations for data loading (resampling / sqlite index rework)
We recently discovered a problem with a very large job. It is a job with a runtime of >31h on 24 nodes and 96GPUs.
We broke this down to the point that, when retrieving the metrics from the metricstore, timeouts occur at various points. On the one hand in the http client in the cc-backend (default 10s), on the other hand in the cc-metric-store server (30s). This results in incomplete data records.
The trivial solution would certainly be to increase the default timeouts. Alternatively, data compression could be considered to reduce the transmission time.