Closed maxulysse closed 4 years ago
I also see hanging processes on bianca. Here are some things I know:
30408 ? S 0:00 /bin/bash /scratch/45553/ac/60b41b2c711a42d4557f8037545597/.command.run.1
32577 ? Z 0:00 \_ [bash] <defunct>
The problem occurs also with HaplotypeCaller and GenotypeGVCFs. These processes are run quite often (once for each interval), so I assume that every process could be affected with some small probability and that we see it mostly for the processes that run often.
Whenever I checked, the process itself has actually already finished. So for HaplotypeCaller, the full output VCF is there and the log file shows that it has finished.
When Nextflow wants to start a process, three scripts are involved:
.command.run
, which sets up the work directory (sets up symlinks and environment)..command.run
script then calls .command.run.1
. That sets up mostly some resource usage monitoring..command.run.1
script then finally calls the .command.sh
script. That is the one containing the actual commands to do the work. For GenotypeGVCFs for example, it consists of only a single line java -Xmx8g -jar $GATK_HOME/GenomeAnalysisTK.jar -T GenotypeGVCFs ...
.Here is how this looks when it is working:
22114 pts/0 S+ 0:00 \_ /bin/bash -ue .command.run
22148 pts/0 S+ 0:00 \_ tee .command.out
22149 pts/0 S+ 0:00 \_ tee .command.err
22150 pts/0 S+ 0:00 \_ /bin/bash /scratch/45553/92/.../.command.run.1
22159 pts/0 S+ 0:00 \_ /bin/bash -euo pipefail /scratch/45553/92/.../.command.sh
22168 pts/0 Sl+ 0:08 | \_ java -Xmx8g -jar /sw/apps/bioinfo/GATK/3.7/GenomeAnalysisTK.jar -T GenotypeGVCFs ...
22160 pts/0 S+ 0:00 \_ /bin/bash /scratch/45553/92/.../.command.run.1
And this is what it looks like when it is hanging:
23950 ? Sl 18:02 \_ /sw/comp/java/x86_64/sun_jdk1.7.0_25/bin/java ... (nextflow stuff)
8424 ? S 0:00 \_ /bin/bash -c /bin/bash -ue .command.run &> /dev/null
8429 ? S 0:00 \_ /bin/bash -ue .command.run
9242 ? S 0:00 \_ tee .command.out
9243 ? S 0:00 \_ tee .command.err
...
9282 ? S 0:00 /bin/bash /scratch/45553/92/.../.command.run.1
13886 ? Z 0:00 \_ [bash] <defunct>
What we can see is:
.command.sh
script has finished.command.run.1
script is still running..command.run.1
has become a zombie.I’ve just found a workaround for the hanging process problem that avoids re-running Nextflow. The solution is to kill the parent of the zombie. In the example, these were the relevant processes (the zombie and its parent):
9282 ? S 0:00 /bin/bash /scratch/45553/92/.../.command.run.1
13886 ? Z 0:00 \_ [bash] <defunct>
Here, I can run a kill -9 9282
to get rid of the parent of the zombie. The zombie itself will become a child of init and then reaped after a while, so it also disappears. A normal kill
(without -9
) does not work for some reason.
Now, the really nice thing is: Nextflow just resumes normal processing after this! So if this was the last GenotypeGVCF process, for example, it would then run ConcatVCF afterwards. So no work is lost, no need to re-run everything!
Impressive. Good investigative work. I think it would be good to communicate that to UPPMAX too, and maybe to Nextflow devs.
UPPMAX people warned,
If the job expires (and there's nothing else running on the node) we will try to clean out /scratch. If bash is hung on something we may well remove the file on
/scratch
(from view, as long as bash has it open it won't go away, really).
So the idea is to warn them, next time it happens, and report on which node it's happening.
If you want to capture some info in case we don't get there in case, it's probably the stack for the process that is stuck that is most interesting.
dmesg -T
will be interesting as well but I hope we have logs of that anyway.
Reply from UPPMAX people after looking into one hung process
From what I've found so far the script that's hung is (not surprisingly) a bash script nextflow set ups to monitor a job. It seems the actual work finishes fine but the monitoring script hangs. The stack ends with a futex. If we attach with gdb and take a stack trace, we see a lot of general bash things (execute_command), but also that it seems to catch a signal while trying to allocate memory but reenters and tries to allocate memory again (and deadlocks). I'd say it's probably a bash bug, but it's possible it can be worked around by playing with options from mallopt(3). If one can provide templates to next flow instead of the command script they create that might work as well (it does some monitor and logs some usage counters, but it also installs a trap EXIT that I think interferes). There seems to be quite a few reports about hangs, but unfortunately most are bad, so I really can't say if anyone else runs into this.
I'm understanding correctly that the task execution directory does not exist?
It does exist.
OK, anything useful in the .command.err/out/log
files in that directory? (you may want to attach here)
Also does the .exitcode
file exist?
Nothing useful in the logs. The .exitcode
file does not exist.
The .command.sh
has actually finished running and all the expected output files are there.
As I understand it, the .exitcode
file is written when the .command.run
script exits. But that is actually still running and waiting (for something).
Are you using a docker or singularity container for this job ?
No, it’s an older version of the pipeline where we were not using containers, yet.
OK, my suggestion is to troubleshot this job in this way: login interactively in a compute node, change in that task work dir and defining the var
NXF_DEBUG=2
then running:
bash .command.run
You can even use NXF_DEBUG=3
for a more detailed tracing.
The problem is that this bug appears only sporadically. I’ve already tried to re-run the job over and over again, trying to trigger it, but I was not able to.
The filesystem on this cluster is sometimes a bit slow, so my guess is that some delay when accessing a file is necessary to trigger it, and when everything is in caches, it cannot be triggered.
I see, in this case an option could be to set a max time hard limit and retry the job execution when it fails.
Just for completeness, here is a backtrace from the /bin/bash .command.run.1
process (the parent of the zombie).
If I understand correctly (thanks to the colleague whose comment @MaxUlysse posted above), it seems that the script hangs while it handles the exit trap.
(gdb) bt
#0 0x00002abffcebbeec in __lll_lock_wait_private () from /usr/lib64/libc.so.6
#1 0x00002abffce3905d in _L_lock_14730 () from /usr/lib64/libc.so.6
#2 0x00002abffce36163 in malloc () from /usr/lib64/libc.so.6
#3 0x000000000046a48b in xmalloc ()
#4 0x0000000000458fab in array_create ()
#5 0x0000000000458ff1 in array_copy ()
#6 0x0000000000454a96 in run_exit_trap ()
#7 0x0000000000456b86 in termsig_handler ()
#8 0x0000000000440d9b in waitchld.isra.10 ()
#9 0x0000000000441338 in sigchld_handler ()
#10 <signal handler called>
#11 0x00002abffce33ad5 in _int_malloc () from /usr/lib64/libc.so.6
#12 0x00002abffce3610c in malloc () from /usr/lib64/libc.so.6
#13 0x000000000046a555 in xrealloc ()
#14 0x0000000000449ef7 in command_substitute ()
#15 0x000000000044caad in param_expand ()
#16 0x000000000044fc77 in expand_word_internal ()
#17 0x00000000004511dc in call_expand_word_internal.constprop.16 ()
#18 0x000000000045167a in expand_string_assignment ()
#19 0x000000000044b3be in expand_string_if_necessary ()
#20 0x000000000044b7bf in do_assignment_internal ()
#21 0x0000000000452336 in expand_word_list_internal ()
#22 0x0000000000430b40 in execute_simple_command ()
#23 0x0000000000432263 in execute_command_internal ()
...
Yes, that sounds like a good option. Is the time limit checked by Nextflow itself? If it were checked in the bash scripts it wouldn’t work ...
No, it's delegated to the cluster.
And with a local executor?
it seems that the script hangs while it handles the exit trap
Interesting, in the .command.run.1
there's this trap
trap 'exit ${ret:=$?}' EXIT
Hardly I see a problem.
And with a local executor?
NF will kill it.
Maybe invoking exit
in a EXIT
signal cause the problem ?
NF will kill it.
Great, then we can use that.
Maybe invoking exit in a EXIT signal cause the problem ?
Hm, but it is working in so many cases ...
Thanks for your help so far! It might well be a bash bug as mentioned above, so I think we’ll use a timeout for now. I’ll investigate further, however, and will definitely let you know if we find out what is actually going on.
Using slurm
now by default, so I think that is issue is not relevant.
We're experiencing a very similar issue. We have yet to confirm that it is the same type of zombie bash processes but we also have jobs running for seemingly forever using a constant 100% cpu of one core. In our case this is a FreeBayes process.
My belief is that the actual job finishes (or potentially crashes) at around 500 min, where the decline in memory occurs. The job works perfectly when I give it a complete node instead:
Notice how much faster it ran even though it still only used 1 core. This is of course partly due to not needing to use swap, but I don't think that could explain all of it, right? We actually have another instance running still, which is at about 5000 minutes at the moment...
Summary of our error:
The command ran was:
run ~/Sarek/somaticVC.nf \
--sample Preprocessing/Recalibrated/recalibrated.tsv \
-profile slurm \
--project ngi2016004 \
--genome_base /sw/data/uppnex/ToolBox/hg38bundle \
--genome GRCh38 \
--tools FreeBayes,HaplotypeCaller,Manta,Mutect2,strelka \
--strelkaBP \
--containerPath ~/Sarek/containers/ \
--tag 2.1.0
I have a hypothesis that I/O overload could be what is triggering these errors: for @marcelm, the overload is caused by a large amount of analysis running at once, and therefore it has not been reproducible on a per-process-basis for him. For us however, the overload is due to the use of swap which is reproducible.
I would be happy to help out with the debugging of this error, but I would need some assistance in how to do so.
Thank you!
@alneberg Also, you're on Irma for this problem, right?
Yes that's correct!
Thanks, I'm reopening the issue then, and calling in the big guns.
@szilvajuhos @pditommaso @marcelm @ewels Do you have any pointers on how to to help resolving that?
@alneberg This is very likely a different problem, and I therefore suggest you open a new issue for it. The problem described here occurs after the process that does the work (here it was GATK) has already finished. It always involves a zombie process and 0% CPU.
That's a good start, let's go to #643 then
Ah, thanks @marcelm! The 0% CPU really marks a difference between the two!
Ok, we have a new case of the zombie process on Irma, so I think it would make sense to open this issue again.
I saw one instance of FreeBayes taking slightly too long time, so I checked it and found this:
ps -aux
...
alneberg 5642 26.7 0.0 20032 1504 ? R Oct02 450:42 /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub
alneberg 8086 0.0 0.0 0 0 ? Z 03:54 0:00 [bash] <defunct>
...
That is consistent with the type of error you experienced here before right, @marcelm?
The staff at Uppmax (@pontus) did this excellent debugging:
Ok, I can see bash being stack on job internals in 5642.
[root@i170 5642]# pstack 5642 | head #0 0x0000000000449340 in ?? () #1 0x000000000044d534 in make_child () #2 0x0000000000457272 in command_substitute () #3 0x000000000045cb9a in ?? () #4 0x000000000045e21a in ?? () #5 0x000000000046178b in ?? () #6 0x0000000000438367 in ?? () #7 0x000000000043a4fb in execute_command_internal ()
The function it's stuck in seems to be unexported, but it references pidstat_table, bgpids, the_pipeline etc.
The job is still running so we still have the option to dig into this problem a bit deeper.
Do you think you could help out with this debugging @pditommaso? Help would be much appreciated.
Can you dump the process tree with ps aufx
are report the branch having 5642
as root ?
@pditommaso Was .command.run.1
renamed to .command.stub
in more recent Nextflow versions?
yes
Then this appears to be the same issue indeed.
Thanks for your quick replies!
@pditommaso, yes it appears to be:
root 5145 0.0 0.0 295956 3836 ? Sl Oct02 0:16 slurmstepd: [389359.batch]
alneberg 5174 0.0 0.0 113192 1640 ? S Oct02 0:00 \_ /bin/bash /var/spool/slurmd/job389359/slurm_script
alneberg 5457 0.0 0.0 107956 664 ? S Oct02 0:00 \_ tee .command.out
alneberg 5463 0.0 0.0 107956 668 ? S Oct02 0:00 \_ tee .command.err
alneberg 5469 0.0 0.0 113192 832 ? S Oct02 0:00 \_ /bin/bash /var/spool/slurmd/job389359/slurm_script
alneberg 5492 0.0 0.0 19704 1632 ? S Oct02 0:00 \_ /bin/bash -c cd /scratch/389359/nxf.krigkDwpZV; eval export NXF_OPTS="-Xms1g" export PATH="/lupus/ngi/production/v1.13/sw/sarek/bin:$PATH"; /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub
alneberg 5570 0.0 0.0 19912 1824 ? S Oct02 0:00 \_ /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub
alneberg 5642 27.2 0.0 20032 1504 ? R Oct02 463:32 \_ /bin/bash /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.stub
alneberg 8086 0.0 0.0 0 0 ? Z 03:54 0:00 \_ [bash] <defunct>
Could you include the complete branch including the .command.run
(the parent process) ?
Of course! I just edited the answer above to include the full branch. Was that what you meant?
I wonder why CPU usage and time are so high (they weren’t when I observed the issue).
I can't seem to find 'command.run' within the output of ps aufx
.
Maybe it does not appear because SLURM rename it /../job389359/slurm_script
. Could you upload the .command.stub
here?
Yes, also cpu and time usage looks unusual.
Yes I think thats quite likely. Here is .command.stub:
#!/bin/bash
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 2 ]] && set -x
set -o pipefail
prev_total=0
declare -a prev_time
mem_tot=$(< /proc/meminfo grep MemTotal | awk '{print $2}')
num_cpus=$(< /proc/cpuinfo grep '^processor' -c)
nxf_pcpu() {
local pid=$1
local proc_time=$(2> /dev/null < /proc/$pid/stat awk '{sum=$14+$15+$16+$17; printf "%.0f",sum}' || echo -n 0)
local cpu_usage=$(echo -n $proc_time ${prev_time[pid]:-0} $total_time $prev_total $num_cpus | awk '{ pct=($1-$2)/($3-$4)*$5 *100; printf "%.1f", pct }' )
prev_time[pid]=$proc_time
nxf_pcpu_ret=$cpu_usage
}
nxf_tree() {
declare -a ALL_CHILD
while read P PP;do
ALL_CHILD[$PP]+=" $P"
done < <(ps -e -o pid= -o ppid=)
stat() {
nxf_pcpu $1
local x_pid=$1
local x_stat=$(2> /dev/null < /proc/$1/stat awk '{print $3}' || echo -n X)
local x_pcpu=$nxf_pcpu_ret
local x_vsz=$(2> /dev/null < /proc/$1/stat awk '{printf "%.0f", $23/1024}' || echo -n 0)
local x_rss=$(2> /dev/null < /proc/$1/status grep VmRSS | awk '{print $2}' || echo -n 0)
local x_pmem=$(echo $x_rss | awk -v mem_tot=$mem_tot '{printf "%.1f", $1/mem_tot*100}')
local x_io=$(2> /dev/null < /proc/$1/io sed 's/^.*:\s*//' | tr '\n' ' ' || echo -n 0)
local x_vm=$(2> /dev/null < /proc/$1/status egrep 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n 0)
stat_ret+="$x_pid $x_stat $x_pcpu $x_pmem $x_vsz $x_rss"
if [[ $x_vm ]]; then stat_ret+=" $x_vm"; else stat_ret+=" 0 0"; fi
if [[ $x_io ]]; then stat_ret+=" $x_io"; fi
stat_ret+='\n'
}
walk() {
stat $1
for i in ${ALL_CHILD[$1]:=}; do walk $i; done
}
stat_ret=''
total_time=$(grep '^cpu ' /proc/stat |awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9+$10; printf "%.0f",sum}')
walk $1
prev_total=$total_time
nxf_tree_ret=$stat_ret
}
nxf_pstat() {
nxf_tree $1
if [[ "$nxf_tree_ret" ]]; then
nxf_pstat_ret=$(printf "$nxf_tree_ret" | awk '{ t3+=($3*10); t4+=($4*10); t5+=$5; t6+=$6; t7+=$7; t8+=$8; t9+=$9; t10+=$10; t11+=$11; t12+=$12; t13+=$13; t14+=$14 } END { printf "%d 0 %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f %.0f\n", NR,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13,t14 }')
else
nxf_pstat_ret=''
fi
}
nxf_sleep() {
if [[ $1 < 0 ]]; then sleep 5;
elif [[ $1 < 10 ]]; then sleep 0.1 2>/dev/null || sleep 1;
elif [[ $1 < 130 ]]; then sleep 1;
else sleep 5; fi
}
nxf_date() {
local ts=$(date +%s%3N); [[ $ts == *3N ]] && date +%s000 || echo $ts
}
nxf_trace() {
local pid=$1; local trg=$2;
local count=0;
declare -a max=(); for i in {0..13}; do max[i]=0; done
while [[ -d /proc/$pid ]]; do
nxf_pstat $pid
[[ ! "$nxf_pstat_ret" ]] && break
IFS=' ' read -a val <<< "$nxf_pstat_ret"; unset IFS
for i in {0..13}; do
[ ${val[i]} -gt ${max[i]} ] && max[i]=${val[i]}
done
echo "pid state %cpu %mem vmem rss peak_vmem peak_rss rchar wchar syscr syscw read_bytes write_bytes" > $trg
echo "${max[@]}" >> $trg
nxf_sleep $count
count=$((count+1))
done
}
trap 'exit ${ret:=$?}' EXIT
touch .command.trace
start_millis=$(nxf_date)
(
/bin/bash -euo pipefail /lupus/proj/ngi2016004/private/johannes/sarek_run_dirs/p.bailey_20180924/run_dir_SA/work/86/e71cecefdcd9c6d5fb12afe64cbe26/.command.sh
) &
pid=$!
nxf_trace "$pid" .command.trace &
mon=$!
wait $pid || ret=$?
end_millis=$(nxf_date)
wait $mon
echo $((end_millis-start_millis)) >> .command.trace
The bash process is spinning, using one full core.
stacktrace: bash_stacktrace.txt
disassemble (the function that it's spinning in seems not to be exported): bash_disassemble.txt
Do you have memory and cpu usage, like you did here?
@MaxUlysse - I think those statistics are only available after a job finishes.
That would make sense...
It seems that the RunBamQC process, never really finished and that the
/scratch/
directory vanished at some point. So the process just waited for almost four days...trace.txt
I don't know if it's a NF or and UPPMAX issue, so I'll try to track it down.