cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Pipelines not detecting termination of jobs #74

Closed IanSudbery closed 5 years ago

IanSudbery commented 5 years ago

I've had several occurrences recently where pipelines will ned detect when a job has finished. That is the cluster will have no jobs running, but the pipeline will still be running and not showing an error or submitting more jobs. Interestingly in this case the log never shows "Waiting for N jobs to finish".

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
2579096 0.00013 STDIN      mb1ims       r     10/19/2018 09:30:58 all.q@sharc-node041.shef.ac.uk     1        

$ tail pipeline.log 
2018-10-19 09:31:30,591 DEBUG main execution - task: pid=42089, grid-session=<drmaa.session.Session object at 0x2aed31334630>, work_dir=/mnt/fastdata/mb1ims/alevin-benmarck
2018-10-19 09:31:30,596 DEBUG main execution - job-options: -V -notify -N neuron_9k.assign_genes.time -p -10 -P gen2reg -l cpu_arch=intel-e5-2630v3 -l rmem=5G -pe smp 16 -R y
2018-10-19 09:31:30,596 DEBUG main execution - running statement: \
                                              cp umi_tools/neuron_9kAligned.sortedByCoord.out.bam $TMPDIR/inbam.bam &&                   cp references/mm_transcriptome.gtf $TMPDIR/geneset.gtf &&                   /usr/bin/time -v                   featureCounts                      -a $TMPDIR/geneset.gtf                      -o umi_tools/neuron_9k.assign_genes                      -T 16                      -s 1                      -R BAM                      $TMPDIR/inbam.bam                   2> umi_tools/neuron_9k.assign_genes.time  &&                   mv inbam.bam.featureCounts.bam umi_tools/neuron_9k.assign_genes.bam &&                    rm $TMPDIR/inbam.bam $TMPDIR/geneset.gtf
2018-10-19 09:31:30,662 DEBUG main execution - job has been submitted with job_id 2579098
2018-10-19 09:31:30,664 DEBUG main execution - task: pid=42089, grid-session=<drmaa.session.Session object at 0x2aed31334630>, work_dir=/mnt/fastdata/mb1ims/alevin-benmarck
2018-10-19 09:31:30,665 DEBUG main execution - job-options: -V -notify -N neurons_2000.assign_genes.time -p -10 -P gen2reg -l cpu_arch=intel-e5-2630v3 -l rmem=5G -pe smp 16 -R y
2018-10-19 09:31:30,665 DEBUG main execution - running statement: \
                                              cp umi_tools/neurons_2000Aligned.sortedByCoord.out.bam $TMPDIR/inbam.bam &&                   cp references/mm_transcriptome.gtf $TMPDIR/geneset.gtf &&                   /usr/bin/time -v                   featureCounts                      -a $TMPDIR/geneset.gtf                      -o umi_tools/neurons_2000.assign_genes                      -T 16                      -s 1                      -R BAM                      $TMPDIR/inbam.bam                   2> umi_tools/neurons_2000.assign_genes.time  &&                   mv inbam.bam.featureCounts.bam umi_tools/neurons_2000.assign_genes.bam &&                    rm $TMPDIR/inbam.bam $TMPDIR/geneset.gtf
2018-10-19 09:31:30,730 DEBUG main execution - job has been submitted with job_id 2579099

The STDIN job here is the pipeline itself.

Interestingly, looking at the code it looks like things should happen something like:

  1. Submit job
  2. Hand flow back to any other threads that are waiting to run.
  3. Print "waiting for X jobs to complete"
  4. Wait for job completion.

Thus my guess is that at 2 some other thread is taking control, but then never handing it back. As I don't really understand the collaborative multithreading paradigm here, this might be one for @AndreasHeger

AndreasHeger commented 5 years ago

Thanks, tricky to debug. As I understand there should not be any threads. I think the most likely culprit is here:

                if status in (drmaa.JobState.DONE, drmaa.JobState.FAILED):
                    running_job_ids.remove(job_id)

It might be that the job has ended with a different JobState and is thus never removed from the list of running jobs. I will add more log messages and look at the job states. If you are online, could you check the job state of:

qacct -j 2579099
qacct -j 2579098
AndreasHeger commented 5 years ago

For example, it might be wise to add : JobState.UNDETERMINED.

AndreasHeger commented 5 years ago

hmm, although that would not explain why the log message "waiting for {} jobs" never appears.

IanSudbery commented 5 years ago

exactly.

I think this might be connect to jobs that have @jobs_limit

AndreasHeger commented 5 years ago

Thanks, good idea.

AndreasHeger commented 5 years ago

This has been fixed in https://github.com/cgat-developers/ruffus/issues/99 and will go away once a new ruffus is released and installed.

IanSudbery commented 5 years ago

Thanks Andreas.