Job completion not being detected

hans-vg commented 7 years ago

I updated FALCONintegrate a few days ago to update all related sub modules because of the recent pypeFLOW fix to the job_queue variable for SLURM. However, now when a job completes in SLURM, it is still being detected as "RUNNING" in the logs.

2016-12-02 23:00:39,848 - pypeflow.simple_pwatcher_bridge - INFO - sleep 10
2016-12-02 23:00:49,859 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 6
2016-12-02 23:00:49,859 - pwatcher.fs_based - DEBUG - query(which='list', jobids=<6>)
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P272970845b783b
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pf43bab5f570d27
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P5c2c511e4cdb43
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P3ad1be394d1e22
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P93d2d6890055a6
2016-12-02 23:00:49,860 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pfcd2bd1698688a
2016-12-02 23:00:49,860 - pypeflow.simple_pwatcher_bridge - INFO - sleep 10
2016-12-02 23:00:59,871 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 6
2016-12-02 23:00:59,871 - pwatcher.fs_based - DEBUG - query(which='list', jobids=<6>)
2016-12-02 23:00:59,872 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P272970845b783b
2016-12-02 23:00:59,872 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pf43bab5f570d27
2016-12-02 23:00:59,872 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P5c2c511e4cdb43
2016-12-02 23:00:59,872 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P3ad1be394d1e22
2016-12-02 23:00:59,873 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P93d2d6890055a6
2016-12-02 23:00:59,873 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pfcd2bd1698688a
2016-12-02 23:00:59,873 - pypeflow.simple_pwatcher_bridge - INFO - sleep 10
2016-12-02 23:01:09,883 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 6`

However, at this point the subjobs had already finished running. If I go to one of the job directories assembly/mypwatcher/jobs/P5c2c511e4cdb43, I can see the job finished from the stdout and stderr files.

...clip..
Comparing raw_reads.51 to raw_reads.44

   Capping mutual k-mer matches over 10000 (effectively -t100)
   Hit count = 480,279,248
   Highwater of 20.27Gb space

===========================================================================
Job Finished

Name                : P5c2c511e4cdb43
User                : hansvgdub
Partition           : bigmemh
Nodes               : bigmem3
Cores               : 8
State               : TIMEOUT
Submit              : 2016-12-01T13:44:26
Start               : 2016-12-01T13:44:27
End                 : 2016-12-01T22:44:46

What would be the best way to further debug this issue?

pb-cdunn commented 7 years ago

State : TIMEOUT

Does that mean the Slurm job timed out? Look in mypwatcher/exits/exit-N and mypwatcher/heartbeats to get some idea what is going on.

If the job is not actually running, the missing heartbeat should be detected eventually. Right now, it waits 120s. If it is called RUNNING after several minutes, there is a bug. Hopefully you can help us debug this as outlined above. If that doesn't help, you can try pwatcher_type = network_based, which is more reliable but requires socket communication between your network hosts.

hans-vg commented 7 years ago

The jobs are actually running on the cluster after the sbatch command. In the all.log file, I can see "Status RUNNING" for each of the jobs.

2016-12-05 21:58:23,288 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 6
2016-12-05 21:58:23,289 - pwatcher.fs_based - DEBUG - query(which='list', jobids=<6>)
2016-12-05 21:58:23,290 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P387cb3afe02e0f
2016-12-05 21:58:23,290 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P272970845b783b
2016-12-05 21:58:23,291 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P19226ea67b72ed
2016-12-05 21:58:23,291 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pb375120217a7b9
2016-12-05 21:58:23,291 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-P7e79a73c8a143b
2016-12-05 21:58:23,291 - pwatcher.fs_based - DEBUG - Status RUNNING for heartbeat:heartbeat-Pfcd2bd1698688a
2016-12-05 21:58:23,291 - pypeflow.simple_pwatcher_bridge - INFO - sleep 10

Since these are still running, the exits/exit-N data isn't present yet. However, I would think there would be data in mypwatcher/heartbeats, but it doesn't seem like anything is being written there.

hansvgdub@farm:mypwatcher/heartbeats$ ls -lh total 0

I was monitoring the all.log file while the jobs were running and was noticing that the jobs always showed up as "RUNNING" even though they were completing. I believe this is why after the initial N number of jobs finish, the main script continues to run waiting for the jobs to finish.

pb-cdunn commented 7 years ago

Note that the exit- files are written upon completion of the bash scripts, so something strange is happening on your system. You'll have to debug this from your end. My best guess: You did not provide enough time to the jobs when submitted (according to the sbatch commands, which you could post here), so you your job-submission system issued SIGKILL. All sub-processes would be killed immediatley, so no exit-N files are written. I do not understand why the heartbeats were never written. Possibly these jobs are killed very quickly, before the first heartbeat. You need to track down the cause of the TIMEOUT from the Slurm log.

The system is really not very complicated. You can debug this yourself by re-running the batch submission command from your own command-line and watching what files are written into the mypwatcher directory tree.

If you don't want to debug it, then see previous comment. Or look into how to submit a blocking call to your system, and I can show you how to use pwatcher_type=blocking. Watching the filesystem causes us endless problems.

hans-vg commented 7 years ago

I'll continue to debug this on my end. I am giving the jobs 19 hours to complete which I know are completing before this time. For now, I'll try changing the pwatcher_type to network_based or message your regarding blocking to see how to configure it. I know our filesystem is a bit slow, which could be the cause of this issue.

hans-vg commented 7 years ago

It seems as though network_based won't work with SLURM. Here are my currently running jobs: main falcon script with 6 subjobs that the pipeline queued up.

          10502908   bigmemh Pb375120 hansvgdu  R       3:45      1 bigmem7
          10502903   bigmemh P387cb3a hansvgdu  R       3:48      1 bigmem3
          10502904   bigmemh P2729708 hansvgdu  R       3:48      1 bigmem9
          10502905   bigmemh P425a8e8 hansvgdu  R       3:48      1 bigmem9
          10502906   bigmemh P5c2c511 hansvgdu  R       3:48      1 bigmem7
          10502907   bigmemh P7e79a73 hansvgdu  R       3:48      1 bigmem7
          10502902   bigmemh   falcon hansvgdu  R       3:52      1 bigmem3

There was nothing in the heartbeats folder again, but I looked into the PID folder and only saw one of my running job IDs.

hansvgdub@farm:$ ls -ltrh mypwatcher/heartbeats/
total 0
hansvgdub@farm:$ ls -ltrh mypwatcher/pid/
total 4.0K
-rw-r--r-- 1 hansvgdub hansvgdub 11 Dec  5 14:32 P387cb3afe02e0f

If I look at the all.log, I can see that this is the only job that is set as 'RUNNING'. The rest of the jobs are UNKNOWN.

2016-12-05 22:39:39,197 - pypeflow.simple_pwatcher_bridge - INFO - sleep 9.0
2016-12-05 22:39:48,206 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 6
2016-12-05 22:39:48,207 - pwatcher.network_based - DEBUG - query(which='list', jobids=<6>)
2016-12-05 22:39:48,208 - pwatcher.network_based - DEBUG - Status RUNNING for jobid:P387cb3afe02e0f
2016-12-05 22:39:48,208 - pwatcher.network_based - DEBUG - Status UNKNOWN for jobid:P272970845b783b
2016-12-05 22:39:48,208 - pwatcher.network_based - DEBUG - Status UNKNOWN for jobid:P425a8e8d9a9508
2016-12-05 22:39:48,209 - pwatcher.network_based - DEBUG - Status UNKNOWN for jobid:P5c2c511e4cdb43
2016-12-05 22:39:48,209 - pwatcher.network_based - DEBUG - Status UNKNOWN for jobid:P7e79a73c8a143b
2016-12-05 22:39:48,210 - pwatcher.network_based - DEBUG - Status UNKNOWN for jobid:Pb375120217a7b9
2016-12-05 22:39:48,210 - pypeflow.simple_pwatcher_bridge - INFO - sleep 9.1

It also happens that the job with a PID is on the same system as the main falcon script. So it looks like it can't find PIDs for jobs that are running on other nodes. Would you suggest I try blocking instead?

pb-cdunn commented 7 years ago

Is there a way to make a "blocking" call to Slurm? With qsub we use -sync y. If so, I'll show you how to do it later tonight. Gotta run to a show...

hans-vg commented 7 years ago

Yes, I believe there is a way in SLURM after reading the documentation here: https://slurm.schedmd.com/sbatch.html

The -m block:block option can configure blocking. I found an example on this post here: http://www.nersc.gov/users/computational-systems/cori/running-jobs/example-batch-scripts/

We can pick up with this debugging later. Thank you for your help today. Enjoy your show.

pb-cdunn commented 7 years ago

-m block:block

is for grouping sets of jobs, not for blocking on jobs. I think we can try this:

-W, --wait
    Do not exit until the submitted job terminates. The exit code of the sbatch command will be the same as the exit code of the submitted job. If the job terminated due to a signal rather than a normal exit, the exit code will be set to 1. In the case of a job array, the exit code recorded will be the highest value for any task in the job array.

I'm not sure of your current sbatch command, so I cannot say exactly how to alter it. I use something like this:

[General]
use_tmpdir = true
pwatcher_type = blocking
job_type = string
job_queue =  qsub -S /bin/bash -sync y -V -q myqueue -N ${JOB_ID}
  -o "${STDOUT_FILE}"
  -e "${STDERR_FILE}"
  -pe smp ${NPROC}
  "${CMD}"

In other words, job_queue should include those ${} strings (with odd names consistent with an internal system used at PacBio), which will be substituted when each command-line is formed. The sge_options will be ignored when job_type=string. You could try this:

job_queue = sbatch -p myqueue -J ${JOB_ID} -o ${STDOUT_FILE} -e ${STDERR_FILE} ${CMD}

That can fail, since the script in CMD might not exist in the filesystem on the remote host yet. But if Slurm copies the script in ${CMD}, then it's fine. I don't know. The alternative is --wrap="python -m pypeflow.do_task task.json", so the definitely existing Python module can wait for the generated JSON file to exist before proceeding. But that's not possible yet with our public pypeFLOW version.

For local runs, you can use something like this:

job_queue = /bin/bash -c "${CMD}" > "${STDOUT_FILE}" 2> "${STDERR_FILE}"

The optional use_tmpdir=/scratch avoids other filesystem problems which have plagued us. It runs every job in a tmpdir on the remote machine and then copies the results back into the filesystem. If use_tmpdir=False, it's off; True is usually the same as use_tmpdir=/tmp.

hans-vg commented 7 years ago

I couldn't get blocking to work, so I think I'm going to try to get the filesystem based monitoring to work.

I have been trying to advance this assembly since early-november. Perhaps the github code updates changed the way the heartbeats used to work? I'm going to make a new directory and copy over the input.fofn and cfg file and try to restart the assembly process. At the very least, I'll be able to tell if the heartbeats monitoring is working in a clean environment, rather than trying to continue with an assembly.

:edit: I am monitoring the newly queued job. I can see the first few tasks have exit files in the exits/ directory with 0 status codes. However, all finished or running jobs still do not have heartbeats file. The pipeline is able to progress by reading the exits files (at least it looks like that when reading through the log. Would you like to look at the all.log for this fresh run?

2016-12-06 22:10:30,087 - pypeflow.simple_pwatcher_bridge - INFO - sleep 4.4s
2016-12-06 22:10:34,492 - pypeflow.simple_pwatcher_bridge - DEBUG - N in queue: 1 (max_jobs=8)
2016-12-06 22:10:34,492 - pwatcher.fs_based - DEBUG - query(which='list', jobids=<1>)
2016-12-06 22:10:34,495 - pwatcher.fs_based - DEBUG - Unable to remove heartbeat '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/mypwatcher/heartbeats/heartbeat-Pc101d117060d3d' when sentinal was found in exit-sentinels listdir.
Traceback (most recent call last):
  File "/share/apps/FALCONintegrate/pypeFLOW/pwatcher/fs_based.py", line 553, in get_status
    os.remove(heartbeat_path)
OSError: [Errno 2] No such file or directory: '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/mypwatcher/heartbeats/heartbeat-Pc101d117060d3d'

2016-12-06 22:10:34,496 - pwatcher.fs_based - DEBUG - Status EXIT 0 for heartbeat:heartbeat-Pc101d117060d3d
2016-12-06 22:10:34,496 - pypeflow.simple_pwatcher_bridge - DEBUG - recently_done: [(Node(0-rawreads), True)]
2016-12-06 22:10:34,496 - pypeflow.simple_pwatcher_bridge - DEBUG - Num done in this iteration: 1
2016-12-06 22:10:34,497 - pypeflow.simple_pwatcher_bridge - INFO - recently_satisfied: set([Node(0-rawreads)])
2016-12-06 22:10:34,497 - pypeflow.simple_pwatcher_bridge - INFO - Num satisfied in this iteration: 1
2016-12-06 22:10:34,497 - pypeflow.simple_pwatcher_bridge - INFO - Num still unsatisfied: 0
2016-12-06 22:10:34,501 - pypeflow.simple_pwatcher_bridge - DEBUG - Created PypeTask('0-rawreads/daligner-scatter', '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/0-rawreads/daligner-scatter', "{'scatter_fn': '/group/dubcovskygrp5/project
s/rust/falcon_unzip/pst_130_assembly/round2/0-rawreads/daligner-scatter/scattered.json'}", "{'db_build_done': PLF('rdb_build_done', '0-rawreads'),\n 'run_jobs_fn': PLF('run_jobs.sh', '0-rawreads')}")
2016-12-06 22:10:34,502 - pypeflow.simple_pwatcher_bridge - DEBUG - Added PRODUCERS['0-rawreads/daligner-scatter'] = PypeTask('0-rawreads/daligner-scatter', '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/0-rawreads/daligner-scatter', "{'s
catter_fn': '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/0-rawreads/daligner-scatter/scattered.json'}", "{'db_build_done': PLF('rdb_build_done', '0-rawreads'),\n 'run_jobs_fn': PLF('run_jobs.sh', '0-rawreads')}")
2016-12-06 22:10:34,503 - pypeflow.simple_pwatcher_bridge - DEBUG - Built PypeTask('0-rawreads/daligner-scatter', '/group/dubcovskygrp5/projects/rust/falcon_unzip/pst_130_assembly/round2/0-rawreads/daligner-scatter', "{'scatter_fn': PLF('scattered.json', '0-rawrea
ds/daligner-scatter')}", "{'db_build_done': PLF('rdb_build_done', '0-rawreads'),\n 'run_jobs_fn': PLF('run_jobs.sh', '0-rawreads')}")
2016-12-06 22:10:34,503 - pypeflow.simple_pwatcher_bridge - DEBUG - New Node(0-rawreads/daligner-scatter) needs set([Node(0-rawreads)])
2016-12-06 22:10:34,504 - pypeflow.simple_pwatcher_bridge - INFO - Num unsatisfied: 1, graph: 3
2016-12-06 22:10:34,504 - pypeflow.simple_pwatcher_bridge - INFO - About to submit: Node(0-rawreads/daligner-scatter)
2016-12-06 22:10:34,504 - pypeflow.simple_pwatcher_bridge - DEBUG - enque nodes:
set([Node(0-rawreads/daligner-scatter)])

pb-cdunn commented 7 years ago

... Hmmm. I think we might have turned off the heartbeats for now, since they caused more problems than they solve. But that begs the question: Why are you tasks listed still as RUNNING? I have no idea. You can experiment with this: https://github.com/pb-cdunn/test-pypeflow

hans-vg commented 7 years ago

Thank you, I'll try it out. Should I just clone that repo in my project directory and use your run.py instead?

I've been testing different assemblies with the most recent commit version from FALCONintegrate. Now, I am sure that the heartbeats aren't written, but the exits are written upon job completion. In my most recent test, I had 59 exits written and detected correctly, but at some point during the 1-preads_ovl step it stops writing the exits. Therefore, the pipeline doesn't continue because it thinks the jobs are still running.

pb-cdunn commented 7 years ago

Clone it anywhere and just run it. You can create more time-consuming tasks and experiment with it. You are having so many problems that I think you need to solve them with very simple PypeTasks, not all of Falcon.

PacificBiosciences / pypeFLOW

Job completion not being detected #44