Open atruskie opened 9 years ago
I killed off six orphaned processes. Five were timeouts. One was a fatal bug (detailed here: https://github.com/QutBioacoustics/audio-analysis/issues/78 )
/home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/52/52d917ed-31fa-4980-bf50-c6f48a5e4bf2 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/73/733bb7e3-f7cb-4b16-ac64-89c7eca27859 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/2d/2d385d29-31bc-43dc-b42e-bef46f048e37 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/16/1670fd9f-ba9a-47c2-ba9e-20b369453e1c /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/61/615a9ccc-8e72-46a8-a54d-07ce8f330c33
/home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/d9/d9b0778e-adcd-4f29-882c-4cc32839d1ea
Good detail. I'm also not sure what to do about this. Looks like we need to find a more robust way to detect and kill rogue analysis processes.
Yeah, I think the solution is something like what we do in AP.exe itself to kill off ffmpeg's...
Just realised this issue might be in the wrong repo! Anyway:
Might need to be turned into a loop. Some dodgy pseudo code:
while thread.alive? || system("ps -aux | grep ${pid}")
# We need to kill the process, because killing the thread leaves
# the process alive but detached, annoyingly enough.
# Sending TERM (15) instead of KILL (9) to allow clean up rather than
# dirty exit
if kill_count < max_kills - 1
Process.kill('TERM', pid)
else
Process.kill('KILL', pid)
# throw worker-level exception, email level error
fail ...
end
#killed = true # not sure what this does
# Give process time to clean up
sleep cleanup_sleep
end
The workers are not properly killing stuck AnalysisPrograms.exe instances. Stuck instances are sapping system memory.
More needs to be done to ensure processes are killed. Not sure what.
The following is from a machine that is running one analysis worker. Only one mono instance is valid (PID
26106
). After running analyses on a machine for a while:And the output from top: