QutEcoacoustics / baw-workers

Workers that can process various long-running or intensive tasks.
Apache License 2.0
3 stars 0 forks source link

Processes from large analysis runs not being cleaned up #30

Open atruskie opened 8 years ago

atruskie commented 8 years ago

The workers are not properly killing stuck AnalysisPrograms.exe instances. Stuck instances are sapping system memory.

More needs to be done to ensure processes are killed. Not sure what.

The following is from a machine that is running one analysis worker. Only one mono instance is valid (PID 26106). After running analyses on a machine for a while:

ubuntu   21518  0.1  0.8 180560 71836 ?        Ssl  Oct30   6:25 resque-1.25.2: Forked 20008 at 1446562242
ubuntu   20008  0.6  0.8 248840 73464 ?        Sl   14:50   0:00  \_ resque-1.25.2: Processing analysis_production since 1446562241 [BawWorkers::Analysis::Action]
ubuntu   20012  0.0  0.0   4440   656 ?        S    14:50   0:00      \_ sh -c mono "/mnt/workers/production/01/runs/system_277055_2015_11_03t14_50_42z/programs/AnalysisPrograms/AnalysisPrograms.exe" audio2csv /source:"/home/ubuntu/bioacous

ubuntu   20014  151  9.7 1887296 795740 ?      Sl   14:50   2:03          \_ mono /mnt/workers/production/01/runs/system_277055_2015_11_03t14_50_42z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/

ubuntu   20839  0.0  0.0  28328  2536 ?        R    14:52   0:00              \_ /usr/bin/sox -q -V4 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/original_audio/41/41cab8ab-36ad-4003-b3c7-a2cbb4d82b49_20150710-033140Z.wav /mnt/worke

ubuntu   21574  0.1  6.9 1713052 566544 ?      Sl   Oct30   7:52 mono /mnt/workers/production/01/runs/system_188329_2015_10_30t16_59_19z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

ubuntu   13176  0.0 13.1 2186596 1071852 ?     Sl   Nov01   2:16 mono /mnt/workers/production/01/runs/system_331663_2015_11_01t10_01_46z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

ubuntu   14626  0.4 14.1 2390380 1159744 ?     Sl   Nov01  13:26 mono /mnt/workers/production/01/runs/system_331617_2015_11_01t13_34_09z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

ubuntu   17201  0.0 10.8 2026992 890804 ?      S    Nov01   0:00 mono /mnt/workers/production/01/runs/system_331563_2015_11_01t16_57_16z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

ubuntu   13065  0.0  2.8 1329996 231000 ?      Sl   Nov02   0:33 mono /mnt/workers/production/01/runs/system_329189_2015_11_02t07_31_29z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

ubuntu    9217  1.2 15.7 2424792 1290536 ?     Sl   10:07   3:36 mono /mnt/workers/production/01/runs/system_277109_2015_11_03t10_07_46z/programs/AnalysisPrograms/AnalysisPrograms.exe audio2csv /source:/home/ubuntu/bioacoustics/qcif_storage

And the output from top:

top - 15:06:46 up 3 days, 22:41,  2 users,  load average: 8.89, 10.52, 11.33
Tasks: 109 total,   3 running, 106 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.7 us, 55.1 sy,  0.0 ni,  4.8 id,  0.2 wa,  0.0 hi,  0.2 si,  0.0 st
KiB Mem:   8176916 total,  7854964 used,   321952 free,     8344 buffers
KiB Swap:        0 total,        0 used,        0 free.  1383200 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
26106 ubuntu    20   0 2043268 922616  13992 R 164.6 11.3   3:30.25 mono
 9217 ubuntu    20   0 2424792 1.231g      8 S   0.0 15.8   3:36.23 mono
13065 ubuntu    20   0 1329996 231000      8 S   0.0  2.8   0:33.72 mono
13176 ubuntu    20   0 2186596 1.022g      8 S   0.0 13.1   2:16.68 mono
14626 ubuntu    20   0 2390380 1.106g      8 S   0.0 14.2  13:26.82 mono
21574 ubuntu    20   0 1713052 566544      8 S   0.0  6.9   7:52.13 mono
atruskie commented 8 years ago

I killed off six orphaned processes. Five were timeouts. One was a fatal bug (detailed here: https://github.com/QutBioacoustics/audio-analysis/issues/78 )

/home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/52/52d917ed-31fa-4980-bf50-c6f48a5e4bf2 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/73/733bb7e3-f7cb-4b16-ac64-89c7eca27859 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/2d/2d385d29-31bc-43dc-b42e-bef46f048e37 /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/16/1670fd9f-ba9a-47c2-ba9e-20b369453e1c /home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/61/615a9ccc-8e72-46a8-a54d-07ce8f330c33

/home/ubuntu/bioacoustics/qcif_storage_nfs/data_prod/analysis_results/system/d9/d9b0778e-adcd-4f29-882c-4cc32839d1ea

cofiem commented 8 years ago

Good detail. I'm also not sure what to do about this. Looks like we need to find a more robust way to detect and kill rogue analysis processes.

atruskie commented 8 years ago

Yeah, I think the solution is something like what we do in AP.exe itself to kill off ffmpeg's...

Just realised this issue might be in the wrong repo! Anyway:

https://github.com/QutBioacoustics/baw-audio-tools/blob/f4ef9f775d590e7e5047c4e5c9181aa57dd1a713/lib/baw-audio-tools/run_external_program.rb#L128

Might need to be turned into a loop. Some dodgy pseudo code:

    while thread.alive? || system("ps -aux | grep ${pid}")  
      # We need to kill the process, because killing the thread leaves
      # the process alive but detached, annoyingly enough.
      # Sending TERM (15) instead of KILL (9) to allow clean up rather than
      # dirty exit
      if kill_count < max_kills - 1
          Process.kill('TERM', pid)
      else
          Process.kill('KILL', pid)
          # throw worker-level exception, email level error
          fail ...
      end

      #killed = true # not sure what this does

      # Give process time to clean up
      sleep cleanup_sleep
    end