Open mgrabovsky opened 3 years ago
The relevant code pertaining to 2. (running tasks) is located in retrace.py
. It's parsing the output of ps
so I can imagine there being some funny interaction with threading, how processes are listed etc.
Edit: ~I'm wondering if we may be witnessing some race conditions here since multiple workers may be writing to the SQLite database at the same time. Though I hope SQLite should be able to handle that.~
Edit 2: OK, it wasn't a database bug. Here's a fragment of the ps
output from one of the moments when an unusually high number of running tasks was detected:
PID PPID ELAPSED CMD
1578079 1 1727 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
[...]
1589317 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589318 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589319 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589320 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589321 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589322 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589323 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589324 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589325 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589326 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589327 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589328 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589329 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589330 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589331 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589332 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589333 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589334 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589335 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589336 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589337 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589338 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589339 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589340 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589341 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589342 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589343 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589344 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589345 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589346 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589347 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589348 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589349 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589350 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589351 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589352 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589353 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589354 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589355 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589356 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589357 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589358 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589359 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589360 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589362 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589363 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589365 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589366 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589367 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
1589368 1578079 0 /usr/bin/python3.6 /usr/bin/retrace-server-worker 762329147
In the 48 hours following the deployment of the Prometheus metrics endpoint, at least two bugs have been made apparent thanks to the Grafana dashboard:
retrace_tasks_finished{result="fail"}
.retrace_tasks_running
) sporadically jumps up to wild numbers, such as 70, 18 or 39, for a few minutes at a time. The maximum allowed number of running tasks (MaxParallelTasks
) is 12 on retrace.fp.org, so these numbers make no sense.