SciLifeLab / Sarek

Detect germline or somatic variants from normal or tumour/normal whole-genome or targeted sequencing
https://nf-co.re/sarek
MIT License
132 stars 6 forks source link

Processes waiting forever #420

Closed maxulysse closed 4 years ago

maxulysse commented 7 years ago

It seems that the RunBamQC process, never really finished and that the /scratch/ directory vanished at some point. So the process just waited for almost four days...

N E X T F L O W  ~  version 0.22.2
Launching `/castor/project/proj_nobackup/CAW/default/main.nf` [sad_bhabha] - revision: 8747027725
CANCER ANALYSIS WORKFLOW ~ 1.1 - 8747027725
Command Line: nextflow run /castor/project/proj_nobackup/CAW/default//main.nf -c /castor/project/proj_nobackup/CAW/default//nextflow.config --project sens2016004 --genome GRCh38 --step recalibrate --sample Preprocessing/NonRecalibrated/nonRecalibrated.tsv
Project Dir : /castor/project/proj_nobackup/CAW/default
Launch Dir  : /castor/project/proj_nobackup/XXX
Work Dir    : /scratch
TSV file    : /castor/project/proj_nobackup/XXX/Preprocessing/NonRecalibrated/nonRecalibrated.tsv
Genome      : GRCh38
Step        : recalibrate
[warm up] executor > local
[0a/28a557] Submitted process > RunSamtoolsStats (TUMOR)
[27/8db34d] Submitted process > RecalibrateBam (TUMOR)
[be/e4149e] Submitted process > RunSamtoolsStats (NORMAL)
[76/1580a1] Submitted process > RecalibrateBam (NORMAL)
[da/57e6db] Submitted process > RunBamQC (TUMOR)

trace.txt

process task_id hash    name    attempt status  exit    realtime        %cpu    vmem    rss     submit  start   complete        duration        realtime        rchar   wchar
RunSamtoolsStats        4       0a/28a557       RunSamtoolsStats (TUMOR)        1       COMPLETED       0       5m 23s  106.0%  133 MB  5.7 MB  2017-08-14 15:20:23.021 2017-08-14 15:20:23.042 2017-08-14 15:25:52.893      5m 30s  5m 23s  6.1 GB  67 B
RecalibrateBam  2       27/8db34d       RecalibrateBam (TUMOR)  1       COMPLETED       0       2h 2m 9s        298.0%  11.9 GB 8.3 GB  2017-08-14 15:25:52.904 2017-08-14 15:25:52.993 2017-08-14 17:28:05.228 2h 2m 12s    2h 2m 9s        6.4 GB  15.7 GB
RunSamtoolsStats        3       be/e4149e       RunSamtoolsStats (NORMAL)        1       COMPLETED       0       4m 33s  96.3%   133 MB  5.7 MB  2017-08-14 17:28:05.240 2017-08-14 17:28:05.402 2017-08-14 17:32:44.722      4m 39s  4m 33s  5.1 GB  67 B
RecalibrateBam  1       76/1580a1       RecalibrateBam (NORMAL)  1       COMPLETED       0       1h 43m 29s      333.0%  11.9 GB 8.2 GB  2017-08-14 17:32:44.732 2017-08-14 17:32:44.830 2017-08-14 19:16:15.605 1h 43m 31s   1h 43m 29s      5.5 GB  13.3 GB
Aug-14 19:20:24.881 [Thread-1] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- first: TaskHandler[id: 6; name: RunBamQC (TUMOR); status: RUNNING; exit: -; workDir: /scratch/da/57e6dbe536acb99ab9eb8b9d3fe202]
...
Aug-18 13:36:08.957 [Thread-1] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- first: TaskHandler[id: 6; name: RunBamQC (TUMOR); status: RUNNING; exit: -; workDir: /scratch/da/57e6dbe536acb99ab9eb8b9d3fe202]
$ ls /scratch/da/57e6dbe536acb99ab9eb8b9d3fe202]
ls: cannot access /scratch/da/57e6dbe536acb99ab9eb8b9d3fe202]: No such file or directory

I don't know if it's a NF or and UPPMAX issue, so I'll try to track it down.

marcelm commented 7 years ago

I also see hanging processes on bianca. Here are some things I know:

marcelm commented 7 years ago
marcelm commented 7 years ago

When Nextflow wants to start a process, three scripts are involved:

Here is how this looks when it is working:

22114 pts/0    S+     0:00       \_ /bin/bash -ue .command.run
22148 pts/0    S+     0:00           \_ tee .command.out
22149 pts/0    S+     0:00           \_ tee .command.err
22150 pts/0    S+     0:00           \_ /bin/bash /scratch/45553/92/.../.command.run.1
22159 pts/0    S+     0:00               \_ /bin/bash -euo pipefail /scratch/45553/92/.../.command.sh
22168 pts/0    Sl+    0:08               |   \_ java -Xmx8g -jar /sw/apps/bioinfo/GATK/3.7/GenomeAnalysisTK.jar -T GenotypeGVCFs ...
22160 pts/0    S+     0:00               \_ /bin/bash /scratch/45553/92/.../.command.run.1

And this is what it looks like when it is hanging:

23950 ?        Sl    18:02      \_ /sw/comp/java/x86_64/sun_jdk1.7.0_25/bin/java ... (nextflow stuff)
 8424 ?        S      0:00          \_ /bin/bash -c /bin/bash -ue .command.run &> /dev/null
 8429 ?        S      0:00              \_ /bin/bash -ue .command.run
 9242 ?        S      0:00                  \_ tee .command.out
 9243 ?        S      0:00                  \_ tee .command.err
...
 9282 ?        S      0:00 /bin/bash /scratch/45553/92/.../.command.run.1
13886 ?        Z      0:00  \_ [bash] <defunct>

What we can see is:

marcelm commented 6 years ago

I’ve just found a workaround for the hanging process problem that avoids re-running Nextflow. The solution is to kill the parent of the zombie. In the example, these were the relevant processes (the zombie and its parent):

 9282 ?        S      0:00 /bin/bash /scratch/45553/92/.../.command.run.1
13886 ?        Z      0:00  \_ [bash] <defunct>

Here, I can run a kill -9 9282 to get rid of the parent of the zombie. The zombie itself will become a child of init and then reaped after a while, so it also disappears. A normal kill (without -9) does not work for some reason.

Now, the really nice thing is: Nextflow just resumes normal processing after this! So if this was the last GenotypeGVCF process, for example, it would then run ConcatVCF afterwards. So no work is lost, no need to re-run everything!

maxulysse commented 6 years ago

Impressive. Good investigative work. I think it would be good to communicate that to UPPMAX too, and maybe to Nextflow devs.

maxulysse commented 6 years ago

UPPMAX people warned,

If the job expires (and there's nothing else running on the node) we will try to clean out /scratch. If bash is hung on something we may well remove the file on /scratch (from view, as long as bash has it open it won't go away, really).

So the idea is to warn them, next time it happens, and report on which node it's happening.

If you want to capture some info in case we don't get there in case, it's probably the stack for the process that is stuck that is most interesting. dmesg -T will be interesting as well but I hope we have logs of that anyway.

maxulysse commented 6 years ago

Reply from UPPMAX people after looking into one hung process

From what I've found so far the script that's hung is (not surprisingly) a bash script nextflow set ups to monitor a job. It seems the actual work finishes fine but the monitoring script hangs. The stack ends with a futex. If we attach with gdb and take a stack trace, we see a lot of general bash things (execute_command), but also that it seems to catch a signal while trying to allocate memory but reenters and tries to allocate memory again (and deadlocks). I'd say it's probably a bash bug, but it's possible it can be worked around by playing with options from mallopt(3). If one can provide templates to next flow instead of the command script they create that might work as well (it does some monitor and logs some usage counters, but it also installs a trap EXIT that I think interferes). There seems to be quite a few reports about hangs, but unfortunately most are bad, so I really can't say if anyone else runs into this.

pditommaso commented 6 years ago

I'm understanding correctly that the task execution directory does not exist?

marcelm commented 6 years ago

It does exist.

pditommaso commented 6 years ago

OK, anything useful in the .command.err/out/log files in that directory? (you may want to attach here)

Also does the .exitcode file exist?

marcelm commented 6 years ago

Nothing useful in the logs. The .exitcode file does not exist. The .command.sh has actually finished running and all the expected output files are there.

marcelm commented 6 years ago

As I understand it, the .exitcode file is written when the .command.run script exits. But that is actually still running and waiting (for something).

pditommaso commented 6 years ago

Are you using a docker or singularity container for this job ?

marcelm commented 6 years ago

No, it’s an older version of the pipeline where we were not using containers, yet.

pditommaso commented 6 years ago

OK, my suggestion is to troubleshot this job in this way: login interactively in a compute node, change in that task work dir and defining the var

NXF_DEBUG=2

then running:

bash .command.run

You can even use NXF_DEBUG=3 for a more detailed tracing.

marcelm commented 6 years ago

The problem is that this bug appears only sporadically. I’ve already tried to re-run the job over and over again, trying to trigger it, but I was not able to.

The filesystem on this cluster is sometimes a bit slow, so my guess is that some delay when accessing a file is necessary to trigger it, and when everything is in caches, it cannot be triggered.

pditommaso commented 6 years ago

I see, in this case an option could be to set a max time hard limit and retry the job execution when it fails.

marcelm commented 6 years ago

Just for completeness, here is a backtrace from the /bin/bash .command.run.1 process (the parent of the zombie).

If I understand correctly (thanks to the colleague whose comment @MaxUlysse posted above), it seems that the script hangs while it handles the exit trap.

(gdb) bt
#0  0x00002abffcebbeec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1  0x00002abffce3905d in _L_lock_14730 () from /usr/lib64/libc.so.6
#2  0x00002abffce36163 in malloc () from /usr/lib64/libc.so.6
#3  0x000000000046a48b in xmalloc ()
#4  0x0000000000458fab in array_create ()
#5  0x0000000000458ff1 in array_copy ()
#6  0x0000000000454a96 in run_exit_trap ()
#7  0x0000000000456b86 in termsig_handler ()
#8  0x0000000000440d9b in waitchld.isra.10 ()
#9  0x0000000000441338 in sigchld_handler ()
#10 <signal handler called>
#11 0x00002abffce33ad5 in _int_malloc () from /usr/lib64/libc.so.6
#12 0x00002abffce3610c in malloc () from /usr/lib64/libc.so.6
#13 0x000000000046a555 in xrealloc ()
#14 0x0000000000449ef7 in command_substitute ()
#15 0x000000000044caad in param_expand ()
#16 0x000000000044fc77 in expand_word_internal ()
#17 0x00000000004511dc in call_expand_word_internal.constprop.16 ()
#18 0x000000000045167a in expand_string_assignment ()
#19 0x000000000044b3be in expand_string_if_necessary ()
#20 0x000000000044b7bf in do_assignment_internal ()
#21 0x0000000000452336 in expand_word_list_internal ()
#22 0x0000000000430b40 in execute_simple_command ()
#23 0x0000000000432263 in execute_command_internal ()
...
marcelm commented 6 years ago

Yes, that sounds like a good option. Is the time limit checked by Nextflow itself? If it were checked in the bash scripts it wouldn’t work ...

pditommaso commented 6 years ago

No, it's delegated to the cluster.

marcelm commented 6 years ago

And with a local executor?

pditommaso commented 6 years ago

it seems that the script hangs while it handles the exit trap

Interesting, in the .command.run.1 there's this trap

trap 'exit ${ret:=$?}' EXIT

Hardly I see a problem.

And with a local executor?

NF will kill it.

pditommaso commented 6 years ago

Maybe invoking exit in a EXIT signal cause the problem ?

marcelm commented 6 years ago

NF will kill it.

Great, then we can use that.

Maybe invoking exit in a EXIT signal cause the problem ?

Hm, but it is working in so many cases ...

Thanks for your help so far! It might well be a bash bug as mentioned above, so I think we’ll use a timeout for now. I’ll investigate further, however, and will definitely let you know if we find out what is actually going on.

maxulysse commented 6 years ago

Check processes that are waiting forever on Bianca

maxulysse commented 6 years ago

Using slurm now by default, so I think that is issue is not relevant.

alneberg commented 6 years ago

We're experiencing a very similar issue. We have yet to confirm that it is the same type of zombie bash processes but we also have jobs running for seemingly forever using a constant 100% cpu of one core. In our case this is a FreeBayes process.

screen shot 2018-09-20 at 15 08 20

My belief is that the actual job finishes (or potentially crashes) at around 500 min, where the decline in memory occurs. The job works perfectly when I give it a complete node instead:

screen shot 2018-09-20 at 15 11 12

Notice how much faster it ran even though it still only used 1 core. This is of course partly due to not needing to use swap, but I don't think that could explain all of it, right? We actually have another instance running still, which is at about 5000 minutes at the moment...

Summary of our error:

The command ran was:

run ~/Sarek/somaticVC.nf \
--sample Preprocessing/Recalibrated/recalibrated.tsv \
-profile slurm \
--project ngi2016004 \
--genome_base /sw/data/uppnex/ToolBox/hg38bundle \
--genome GRCh38 \
--tools FreeBayes,HaplotypeCaller,Manta,Mutect2,strelka \
--strelkaBP \
--containerPath ~/Sarek/containers/ \
--tag 2.1.0

I have a hypothesis that I/O overload could be what is triggering these errors: for @marcelm, the overload is caused by a large amount of analysis running at once, and therefore it has not been reproducible on a per-process-basis for him. For us however, the overload is due to the use of swap which is reproducible.

I would be happy to help out with the debugging of this error, but I would need some assistance in how to do so.

Thank you!

maxulysse commented 6 years ago

@alneberg Also, you're on Irma for this problem, right?

alneberg commented 6 years ago

Yes that's correct!

maxulysse commented 6 years ago

Thanks, I'm reopening the issue then, and calling in the big guns.

@szilvajuhos @pditommaso @marcelm @ewels Do you have any pointers on how to to help resolving that?

marcelm commented 6 years ago

@alneberg This is very likely a different problem, and I therefore suggest you open a new issue for it. The problem described here occurs after the process that does the work (here it was GATK) has already finished. It always involves a zombie process and 0% CPU.

maxulysse commented 6 years ago

That's a good start, let's go to #643 then

alneberg commented 6 years ago

Ah, thanks @marcelm! The 0% CPU really marks a difference between the two!

alneberg commented 6 years ago

Ok, we have a new case of the zombie process on Irma, so I think it would make sense to open this issue again.

I saw one instance of FreeBayes taking slightly too long time, so I checked it and found this:

ps -aux
...
alneberg  5642 26.7  0.0  20032  1504 ?        R    Oct02 450:42 /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub
alneberg  8086  0.0  0.0      0     0 ?        Z    03:54   0:00 [bash] <defunct>
...

That is consistent with the type of error you experienced here before right, @marcelm?

The staff at Uppmax (@pontus) did this excellent debugging:

Ok, I can see bash being stack on job internals in 5642.

[root@i170 5642]# pstack 5642 | head
#0  0x0000000000449340 in ?? ()
#1  0x000000000044d534 in make_child ()
#2  0x0000000000457272 in command_substitute ()
#3  0x000000000045cb9a in ?? ()
#4  0x000000000045e21a in ?? ()
#5  0x000000000046178b in ?? ()
#6  0x0000000000438367 in ?? ()
#7  0x000000000043a4fb in execute_command_internal ()

The function it's stuck in seems to be unexported, but it references pidstat_table, bgpids, the_pipeline etc.

The job is still running so we still have the option to dig into this problem a bit deeper.

Do you think you could help out with this debugging @pditommaso? Help would be much appreciated.

pditommaso commented 6 years ago

Can you dump the process tree with ps aufx are report the branch having 5642 as root ?

marcelm commented 6 years ago

@pditommaso Was .command.run.1 renamed to .command.stub in more recent Nextflow versions?

pditommaso commented 6 years ago

yes

marcelm commented 6 years ago

Then this appears to be the same issue indeed.

alneberg commented 6 years ago

Thanks for your quick replies!

@pditommaso, yes it appears to be:

root      5145  0.0  0.0 295956  3836 ?        Sl   Oct02   0:16 slurmstepd: [389359.batch]                                                                                       
alneberg  5174  0.0  0.0 113192  1640 ?        S    Oct02   0:00  \_ /bin/bash /var/spool/slurmd/job389359/slurm_script                                                           
alneberg  5457  0.0  0.0 107956   664 ?        S    Oct02   0:00      \_ tee .command.out                                                                                         
alneberg  5463  0.0  0.0 107956   668 ?        S    Oct02   0:00      \_ tee .command.err
alneberg  5469  0.0  0.0 113192   832 ?        S    Oct02   0:00      \_ /bin/bash /var/spool/slurmd/job389359/slurm_script                                                      
alneberg  5492  0.0  0.0  19704  1632 ?        S    Oct02   0:00          \_ /bin/bash -c cd /scratch/389359/nxf.krigkDwpZV; eval export NXF_OPTS="-Xms1g" export PATH="/lupus/ngi/production/v1.13/sw/sarek/bin:$PATH"; /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub                                                                                                                                                                           
alneberg  5570  0.0  0.0  19912  1824 ?        S    Oct02   0:00              \_ /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub                                                                                                                                 
alneberg  5642 27.2  0.0  20032  1504 ?        R    Oct02 463:32                  \_ /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub                                                                                                                             
alneberg  8086  0.0  0.0      0     0 ?        Z    03:54   0:00                      \_ [bash] <defunct>
pditommaso commented 6 years ago

Could you include the complete branch including the .command.run (the parent process) ?

alneberg commented 6 years ago

Of course! I just edited the answer above to include the full branch. Was that what you meant?

marcelm commented 6 years ago

I wonder why CPU usage and time are so high (they weren’t when I observed the issue).

alneberg commented 6 years ago

I can't seem to find 'command.run' within the output of ps aufx.

pditommaso commented 6 years ago

Maybe it does not appear because SLURM rename it /../job389359/slurm_script. Could you upload the .command.stub here?

pditommaso commented 6 years ago

Yes, also cpu and time usage looks unusual.

alneberg commented 6 years ago

Yes I think thats quite likely. Here is .command.stub:

#!/bin/bash
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 2 ]] && set -x

set -o pipefail
prev_total=0
declare -a prev_time
mem_tot=$(< /proc/meminfo grep MemTotal | awk '{print $2}')
num_cpus=$(< /proc/cpuinfo grep '^processor' -c)

nxf_pcpu() {
    local pid=$1
    local proc_time=$(2> /dev/null < /proc/$pid/stat awk '{sum=$14+$15+$16+$17; printf "%.0f",sum}' || echo -n 0)
    local cpu_usage=$(echo -n $proc_time ${prev_time[pid]:-0} $total_time $prev_total $num_cpus | awk '{ pct=($1-$2)/($3-$4)*$5 *100; printf "%.1f", pct }' )
    prev_time[pid]=$proc_time
    nxf_pcpu_ret=$cpu_usage
}

nxf_tree() {
    declare -a ALL_CHILD
    while read P PP;do
        ALL_CHILD[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    stat() {
        nxf_pcpu $1
        local x_pid=$1
        local x_stat=$(2> /dev/null < /proc/$1/stat awk '{print $3}' || echo -n X)
        local x_pcpu=$nxf_pcpu_ret

        local x_vsz=$(2> /dev/null < /proc/$1/stat awk '{printf "%.0f", $23/1024}' || echo -n 0)
        local x_rss=$(2> /dev/null < /proc/$1/status grep VmRSS | awk '{print $2}' || echo -n 0)
        local x_pmem=$(echo $x_rss | awk -v mem_tot=$mem_tot '{printf "%.1f", $1/mem_tot*100}')

        local x_io=$(2> /dev/null < /proc/$1/io sed 's/^.*:\s*//' | tr '\n' ' ' || echo -n 0)
        local x_vm=$(2> /dev/null < /proc/$1/status egrep 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n 0)

        stat_ret+="$x_pid $x_stat $x_pcpu $x_pmem $x_vsz $x_rss"
        if [[ $x_vm ]]; then stat_ret+=" $x_vm"; else stat_ret+=" 0 0"; fi
        if [[ $x_io ]]; then stat_ret+=" $x_io"; fi
        stat_ret+='\n'
    }

    walk() {
        stat $1
        for i in ${ALL_CHILD[$1]:=}; do walk $i; done
    }
    stat_ret=''
    total_time=$(grep '^cpu ' /proc/stat |awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9+$10; printf "%.0f",sum}')
    walk $1
    prev_total=$total_time
    nxf_tree_ret=$stat_ret
}

nxf_pstat() {
    nxf_tree $1
    if [[ "$nxf_tree_ret" ]]; then
      nxf_pstat_ret=$(printf "$nxf_tree_ret" | awk '{ t3+=($3*10); t4+=($4*10); t5+=$5; t6+=$6; t7+=$7; t8+=$8; t9+=$9; t10+=$10; t11+=$11; t12+=$12; t13+=$13; t14+=$14 } END { printf "%d 0 %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f\n", NR,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 }')
    else
      nxf_pstat_ret=''
    fi
}

nxf_sleep() {
  if [[ $1 < 0 ]]; then sleep 5;
  elif [[ $1 < 10 ]]; then sleep 0.1 2>/dev/null || sleep 1;
  elif [[ $1 < 130 ]]; then sleep 1;
  else sleep 5; fi
}

nxf_date() {
    local ts=$(date +%s%3N); [[ $ts == *3N ]] && date +%s000 || echo $ts
}

nxf_trace() {
  local pid=$1; local trg=$2;
  local count=0;
  declare -a max=(); for i in {0..13}; do max[i]=0; done
  while [[ -d /proc/$pid ]]; do
    nxf_pstat $pid
    [[ ! "$nxf_pstat_ret" ]] && break
    IFS=' ' read -a val <<< "$nxf_pstat_ret"; unset IFS
    for i in {0..13}; do
      [ ${val[i]} -gt ${max[i]} ] && max[i]=${val[i]}
    done
    echo "pid state %cpu %mem vmem rss peak_vmem peak_rss rchar wchar syscr syscw read_bytes write_bytes" > $trg
    echo "${max[@]}" >> $trg
    nxf_sleep $count
    count=$((count+1))
  done
}

trap 'exit ${ret:=$?}' EXIT
touch .command.trace
start_millis=$(nxf_date)
(
/bin/bash -euo pipefail /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.sh
) &
pid=$!
nxf_trace "$pid" .command.trace &
mon=$!
wait $pid || ret=$?
end_millis=$(nxf_date)
wait $mon
echo $((end_millis-start_millis)) >> .command.trace
maxulysse commented 6 years ago

Do you have memory and cpu usage, like you did here?

pontus commented 6 years ago

The bash process is spinning, using one full core.

stacktrace: bash_stacktrace.txt

disassemble (the function that it's spinning in seems not to be exported): bash_disassemble.txt

ewels commented 6 years ago

Do you have memory and cpu usage, like you did here?

@MaxUlysse - I think those statistics are only available after a job finishes.

maxulysse commented 6 years ago

That would make sense...