SystemsGenetics / EnTAPnf

Functional Annotation of Gene Lists
MIT License
3 stars 4 forks source link

Segmentation Fault on NRP #1

Closed 4ctrl-alt-del closed 4 years ago

4ctrl-alt-del commented 5 years ago

When attempting to run the orange example on NRP it failed after running for about 80 minutes with the following error:

WARN: Killing pending tasks (2)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007f169850cb12, pid=8, tid=0x00007f1681b1eb10
#
# JRE version: OpenJDK Runtime Environment (8.0_212-b04) (build 1.8.0_212-b04)
# Java VM: OpenJDK 64-Bit Server VM (25.212-b04 mixed mode linux-amd64 compressed oops)
# Derivative: IcedTea 3.12.0
# Distribution: Custom build (Sat May  4 17:33:35 UTC 2019)
# Problematic frame:
# C  [ld-musl-x86_64.so.1+0x50b12]  memcpy+0x2c
#
# Core dump written. Default location: /workspace/alucinor/core or core.8
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid8.log
/bin/bash: line 1:     8 Segmentation fault      nextflow run systemsgenetics/AnnoTater -name loving-crick
4ctrl-alt-del commented 5 years ago

The output references a report file that I think is saved somewhere on NRP but I have no idea how to look for it let alone get it.

4ctrl-alt-del commented 5 years ago

Attempted to run it again last night on NRP. It failed with a completely different error:

Error executing process > 'nr_index'

Caused by:
  Host is unreachable (Host unreachable)

WARN: Killing pending tasks (2)
bentsherman commented 5 years ago

@4ctrl-alt-del The first error looks like nextflow itself had a segfault, this has happened to me before but I think it's usually something you can ignore and just run again. As for the second error, you can zoom in on the nr_index process to see why this error is happening.

4ctrl-alt-del commented 5 years ago

How do I "zoom in"?

spficklin commented 5 years ago

@4ctrl-alt-del we can chat on zooming in as that's just debugging with Nextflow.

4ctrl-alt-del commented 5 years ago

After several updates annotator still crashes on NRP but now with different symptoms since we met last week @spficklin .

Nextflow Output:

$ nextflow -C nextflow.config kuberun systemsgenetics/AnnoTater -v deepgtex-prp -config k8s 
Pod started: sleepy-stonebraker
N E X T F L O W  ~  version 19.07.0
Launching `systemsgenetics/AnnoTater` [sleepy-stonebraker] - revision: 3a6106836d [master]
NOTE: Your local project version looks outdated - a different revision is available in the remote repository [3fc8a01d8b]

General Information:
--------------------
  Profile(s):         standard
  Container Engine:   null

Input Files:
-----------------
  Transcript (mRNA) file:     /workspace/alucinor/examples/Citrus_sinensis-orange1.1g015632m.g.fasta

Data Files:
-----------------
  InterProScan data:          /workspace/alucinor/dbs/interproscan/interproscan-5.36-75.0/data
  Panther data:               /workspace/alucinor/dbs/panther/panther
  NCBI nr data:               /workspace/alucinor/dbs/nr
  Uniprot SwissProt data:     null

Output Parameters:
------------------
  Output directory:           /workspace/alucinor/output

WARN: The channel `create` method is deprecated -- it will be removed in a future release
WARN: The channel `create` method is deprecated -- it will be removed in a future release
WARN: The channel `create` method is deprecated -- it will be removed in a future release
[1d/66d9bf] Submitted process > uniprot_sprot_index
[57/216ad9] Submitted process > interproscan (1)
[57/216ad9] NOTE: Process `interproscan (1)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[c2/4bf64d] Submitted process > interproscan (2)
[c2/4bf64d] NOTE: Process `interproscan (2)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[07/a2081b] Submitted process > interproscan (3)
[07/a2081b] NOTE: Process `interproscan (3)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[cc/46fac4] Submitted process > interproscan (4)
[cc/46fac4] NOTE: Process `interproscan (4)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[37/030adf] Submitted process > interproscan (5)
[1d/66d9bf] NOTE: Process `uniprot_sprot_index` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[ef/8bb7c4] Submitted process > interproscan (6)
[37/030adf] NOTE: Process `interproscan (5)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[5f/7be243] Submitted process > interproscan (8)
[ef/8bb7c4] NOTE: Process `interproscan (6)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[51/2477cc] Submitted process > interproscan (7)
[51/2477cc] NOTE: Process `interproscan (7)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[51/64ec62] Submitted process > interproscan (9)
[51/64ec62] NOTE: Process `interproscan (9)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[87/5b15f3] Re-submitted process > interproscan (1)
[87/5b15f3] NOTE: Process `interproscan (1)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[36/b3eac3] Re-submitted process > interproscan (2)
[36/b3eac3] NOTE: Process `interproscan (2)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[ba/1fc40b] Re-submitted process > interproscan (3)
[ba/1fc40b] NOTE: Process `interproscan (3)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[b0/eab735] Re-submitted process > interproscan (4)
[b0/eab735] NOTE: Process `interproscan (4)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[ac/c9c649] Re-submitted process > uniprot_sprot_index
[ac/c9c649] NOTE: Process `uniprot_sprot_index` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[94/8ae03b] Re-submitted process > interproscan (5)
[94/8ae03b] NOTE: Process `interproscan (5)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[b3/c6bdca] Re-submitted process > interproscan (6)
[b3/c6bdca] NOTE: Process `interproscan (6)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[9c/6752db] Re-submitted process > interproscan (7)
[9c/6752db] NOTE: Process `interproscan (7)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[93/a6465c] Re-submitted process > interproscan (9)
[93/a6465c] NOTE: Process `interproscan (9)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (2)
[cc/7e335a] Re-submitted process > interproscan (1)
[cc/7e335a] NOTE: Process `interproscan (1)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[bf/b77aac] Re-submitted process > interproscan (2)
[bf/b77aac] NOTE: Process `interproscan (2)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[a2/beeb6b] Re-submitted process > interproscan (3)
[a2/beeb6b] NOTE: Process `interproscan (3)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[65/b8a814] Re-submitted process > interproscan (4)
[5f/7be243] NOTE: Process `interproscan (8)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (1)
[5b/572060] Re-submitted process > uniprot_sprot_index
[65/b8a814] NOTE: Process `interproscan (4)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[7b/404af2] Re-submitted process > interproscan (5)
[5b/572060] NOTE: Process `uniprot_sprot_index` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[2c/93f504] Re-submitted process > interproscan (6)
[2c/93f504] NOTE: Process `interproscan (6)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[c1/45e520] Re-submitted process > interproscan (7)
[c1/45e520] NOTE: Process `interproscan (7)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[e6/556203] Re-submitted process > interproscan (9)
[e6/556203] NOTE: Process `interproscan (9)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[af/60d69b] Re-submitted process > interproscan (1)
[7b/404af2] NOTE: Process `interproscan (5)` terminated for an unknown reason -- Likely it has been terminated by the external system -- Execution is retried (3)
[4d/04d4bb] Re-submitted process > interproscan (2)
Error executing process > 'interproscan (1)'

Caused by:
  Process `interproscan (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  # Call InterProScan on a single sequence.
  /usr/local/interproscan/interproscan.sh       -f TSV,XML       --goterms       --input Citrus_sinensis-orange1.1g015632m.g.1.fasta       --iprlookup       --pathways       --seqtype n       --cpu 2       --output-dir .       --mode standalone       --applications TIGRFAM,SFLD,SUPERFAMILY,Gene3D,Hamap,Coils,ProSiteProfiles,SMART,CDD,PRINTS,Pfam,MobiDBLite,PIRSF,PANTHER,ProDom
  # Remove the temp directory created by InterProScan
  rm -rf ./temp

Command exit status:
  -

Command output:
  (empty)

Command wrapper:
  failed to open log file "/var/log/pods/deepgtex-prp_nf-af60d69b8398944dafaa6751f7bfd421_b4dc3618-aae1-42a6-9cbb-1111d4b8b591/nf-af60d69b8398944dafaa6751f7bfd421/0.log": open /var/log/pods/deepgtex-prp_nf-af60d69b8398944dafaa6751f7bfd421_b4dc3618-aae1-42a6-9cbb-1111d4b8b591/nf-af60d69b8398944dafaa6751f7bfd421/0.log: no such file or directory

Work dir:
  /workspace/alucinor/work/af/60d69b8398944dafaa6751f7bfd421

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

WARN: Killing pending tasks (1)
[cd/c3d25f] Re-submitted process > interproscan (3)

The nextflow pods fail like so by listing the pods(edited to only show relevant ones):

nf-0078adb7054e494d4126beb5997be320   0/1     ContainerCannotRun   0
nf-07a2081bb14b91da55ea20cdfb83f7ce   0/1     ContainerCannotRun   0
nf-16cf64fed2f31a71c5b6a10074adb37d   0/1     ContainerCannotRun   0
nf-1d66d9bfa25327a55b199d486f390157   0/1     ContainerCannotRun   0
nf-1e3d82e0a449ce29490dd222d74bba7f   0/1     ContainerCannotRun   0
nf-1e3f38e6389bfab8b5769a4d320d00f8   0/1     ContainerCannotRun   0
nf-2c93f504a81c1d580efeaa1272b57323   0/1     ContainerCannotRun   0
nf-350f991eeb346fa80f424230df61aeba   0/1     ContainerCannotRun   0
nf-36b3eac3752b61ab4a014f431c374b61   0/1     ContainerCannotRun   0
nf-37030adf4bb3cf15eabcb17bc386c867   0/1     ContainerCannotRun   0
nf-37c39c333f5f8a0619af8524be89eef7   0/1     ContainerCannotRun   0
nf-3d3ca5e92338652b4a527072b913cb9b   0/1     ContainerCannotRun   0
nf-40a98fa862a3671be782de37c69ceb38   0/1     ContainerCannotRun   0
nf-4fbf7849e74c02f2db6116cd25e5fac6   0/1     ContainerCannotRun   0
nf-512477cc0a4695d3d5157ad772c9b5ee   0/1     ContainerCannotRun   0
nf-5164ec624049083330ae3472e66171bb   0/1     ContainerCannotRun   0
nf-57216ad99d5a2598c2d46ab686f42ed0   0/1     ContainerCannotRun   0
nf-5b572060f014127d19ce649de3f5fccd   0/1     ContainerCannotRun   0
nf-5d9129748cb8da1cff983f40ca105613   0/1     ContainerCannotRun   0
nf-5f7be243d47a2f3d423d308505f7033b   0/1     ContainerCannotRun   0
nf-60f900d6c4d4041765523e21e1900602   0/1     ContainerCannotRun   0
nf-65b8a814cfd1f45cd23de76df03e44a2   0/1     ContainerCannotRun   0
nf-74de832c08ae08cf5b54080a567d5d97   0/1     ContainerCannotRun   0
nf-7a8b4b7ae0069a01b833502cd214ad50   0/1     ContainerCannotRun   0
nf-7b404af203b317917cf042b697eac010   0/1     ContainerCannotRun   0
nf-7bdb137a1555f4569c38fea6e9ffa7f7   0/1     ContainerCannotRun   0
nf-835cafb1458000c2569223baa00fc988   0/1     ContainerCannotRun   0
nf-875b15f357a68b66c2433786d4a9ca06   0/1     ContainerCannotRun   0
nf-90f95c296a72831d0bf3293e27881192   0/1     ContainerCannotRun   0
nf-93a6465c0a758758cb6ee389939657e7   0/1     ContainerCannotRun   0
nf-948ae03b8896ed77c249625b45731225   0/1     ContainerCannotRun   0
nf-9653ded58c2ef48c0030bce88bd32737   0/1     ContainerCannotRun   0
nf-9c6752db58268bb36619ab9711d8a2a4   0/1     ContainerCannotRun   0
nf-a2beeb6b27374abeb8f8dfa773a29c95   0/1     ContainerCannotRun   0
nf-acc9c649c27e80707007f8a4252c6db5   0/1     ContainerCannotRun   0
nf-af60d69b8398944dafaa6751f7bfd421   0/1     ContainerCannotRun   0
nf-b0eab7351dcc264ba82358b25a0cae48   0/1     ContainerCannotRun   0
nf-b3c6bdca7c95b5462994caf0a1678443   0/1     ContainerCannotRun   0
nf-ba1fc40bee86602fdb360e253c65e813   0/1     ContainerCannotRun   0
nf-bfb77aac557543fcc057c13d03939a35   0/1     ContainerCannotRun   0
nf-c145e5205552aa9d3ffb03e1b65a21b4   0/1     ContainerCannotRun   0
nf-c24bf64d574036b12d465f5e257c3539   0/1     ContainerCannotRun   0
nf-c4936b7d5aa3ca588f991b61eadb01cd   0/1     ContainerCannotRun   0
nf-c55f1a09ca91e605b2cb7225b3fec9d6   0/1     ContainerCannotRun   0
nf-cc46fac44d29439581b7c628159f21e9   0/1     ContainerCannotRun   0
nf-cc7e335ac8ab3e7926ee0c5e9f0d7130   0/1     ContainerCannotRun   0
nf-d83a66947a8b5467aea005933f2175dc   0/1     ContainerCannotRun   0
nf-da702cde232c10e27e66c6cf8db8d296   0/1     ContainerCannotRun   0
nf-dd5a33d85229589098906ef25c6f25bf   0/1     ContainerCannotRun   0
nf-e64b4e480d77a0f670cbf1a06f83d5a2   0/1     ContainerCannotRun   0
nf-e65562036991862d43d5755bc0d2a7f6   0/1     ContainerCannotRun   0
nf-eb503542269fa919938d75a76b479f58   0/1     ContainerCannotRun   0
nf-edc41c1ce3186f9b953e129c35a68c95   0/1     ContainerCannotRun   0
nf-edf90d5b07682ff32dcd1ada9546d959   0/1     ContainerCannotRun   0
nf-ef8bb7c4338ec8c8b07f074b77d8ee2e   0/1     ContainerCannotRun   0
nf-f44fdf4f347459801c6213acc98cef40   0/1     ContainerCannotRun   0
nf-f85d06a143a58008ba732034f05b2d14   0/1     ContainerCannotRun   0
nf-faed0088f3308a96d0660ba3cfd82626   0/1     ContainerCannotRun   0

I attempted to run it twice and it produced the identical crash.

bentsherman commented 5 years ago

You can debug further by inspecting the output of an individual pod:

kubectl logs <pod-name>

And also by inspecting the work directory of the process that failed (at least the one that nextflow prints when it terminates):

cd  /workspace/alucinor/work/af/60d69b8398944dafaa6751f7bfd421
ls -al
bentsherman commented 5 years ago

Also @4ctrl-alt-del make sure you delete any dangling pods that are left over by a workflow. Nextflow should delete them for you when it exits but sometimes it doesn't clean them up properly. Here's a command you can use to delete all of the "ContainerCannotRun" pods in batch:

kubectl delete pods $(kubectl get pods --no-headers | grep 'ContainerCannotRun' | awk '{ print $1 }')
4ctrl-alt-del commented 4 years ago

kubectl logs (This is the same for ALL crashed pods)

failed to open log file "/var/log/pods/deepgtex-prp_nf-f5c1244025c3be17f2b1a93ef8d61276_7bfa933c-1a09-40a5-a4b4-528ac1a0a142/nf-f5c1244025c3be17f2b1a93ef8d61276/0.log": open /var/log/pods/deepgtex-prp_nf-f5c1244025c3be17f2b1a93ef8d61276_7bfa933c-1a09-40a5-a4b4-528ac1a0a142/nf-f5c1244025c3be17f2b1a93ef8d61276/0.log: no such file or directory

Looking at /workspace/alucinor/work/...:

total 10
drwxr-xr-x 1 root root    3 Oct 28 20:07 .
drwxr-xr-x 1 root root    2 Oct 28 20:07 ..
-rw-r--r-- 1 root root  342 Oct 28 20:07 .command.log
-rw-r--r-- 1 root root 8606 Oct 28 20:07 .command.run
-rw-r--r-- 1 root root  130 Oct 28 20:07 .command.sh

cat .command.log:

failed to open log file "/var/log/pods/deepgtex-prp_nf-ca10407be0e24fb80121ed298ecb6d23_216a3a19-406f-4ab9-a847-4fbba65abeb6/nf-ca10407be0e24fb80121ed298ecb6d23/0.log": open /var/log/pods/deepgtex-prp_nf-ca10407be0e24fb80121ed298ecb6d23_216a3a19-406f-4ab9-a847-4fbba65abeb6/nf-ca10407be0e24fb80121ed298ecb6d23/0.log: no such file or directory

cat .command.run

#!/bin/bash
# NEXTFLOW TASK: uniprot_sprot_index
set -e
set -u
NXF_DEBUG=${NXF_DEBUG:=0}; [[ $NXF_DEBUG > 1 ]] && set -x
NXF_ENTRY=${1:-nxf_main}

nxf_tree() {
    local pid=$1

    declare -a ALL_CHILDREN
    while read P PP;do
        ALL_CHILDREN[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    pstat() {
        local x_pid=$1
        local STATUS=$(2> /dev/null < /proc/$1/status egrep 'Vm|ctxt')

        if [ $? = 0 ]; then
        local  x_vsz=$(echo "$STATUS" | grep VmSize | awk '{print $2}' || echo -n '0')
        local  x_rss=$(echo "$STATUS" | grep VmRSS | awk '{print $2}' || echo -n '0')
        local x_peak=$(echo "$STATUS" | egrep 'VmPeak|VmHWM' | sed 's/^.*:\s*//' | sed 's/[\sa-zA-Z]*$//' | tr '\n' ' ' || echo -n '0 0')
        local x_pmem=$(awk -v rss=$x_rss -v mem_tot=$mem_tot 'BEGIN {printf "%.0f", rss/mem_tot*100*10}' || echo -n '0')
        local vol_ctxt=$(echo "$STATUS" | grep '\bvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
        local inv_ctxt=$(echo "$STATUS" | grep '\bnonvoluntary_ctxt_switches' | awk '{print $2}' || echo -n '0')
        cpu_stat[x_pid]="$x_pid $x_pmem $x_vsz $x_rss $x_peak $vol_ctxt $inv_ctxt"
        fi
    }

    pwalk() {
        pstat $1
        for i in ${ALL_CHILDREN[$1]:=}; do pwalk $i; done
    }

    pwalk $1
}

nxf_stat() {
    cpu_stat=()
    nxf_tree $1

    declare -a sum=(0 0 0 0 0 0 0 0)
    local pid
    local i
    for pid in "${!cpu_stat[@]}"; do
        local row=(${cpu_stat[pid]})
        [ $NXF_DEBUG = 1 ] && echo "++ stat mem=${row[*]}"
        for i in "${!row[@]}"; do
        if [ $i != 0 ]; then
            sum[i]=$((sum[i]+row[i]))
        fi
        done
    done

    [ $NXF_DEBUG = 1 ] && echo -e "++ stat SUM=${sum[*]}"

    for i in {1..7}; do
        if [ ${sum[i]} -lt ${cpu_peak[i]} ]; then
            sum[i]=${cpu_peak[i]}
        else
            cpu_peak[i]=${sum[i]}
        fi
    done

    [ $NXF_DEBUG = 1 ] && echo -e "++ stat PEAK=${sum[*]}\n"
    nxf_stat_ret=(${sum[*]})
}

nxf_sleep() {
  sleep $1 2>/dev/null || sleep 1;
}

nxf_mem_watch() {
    set -o pipefail
    local pid=$1
    local trace_file=.command.trace
    local count=0;
    declare -a cpu_stat=(0 0 0 0 0 0 0 0)
    declare -a cpu_peak=(0 0 0 0 0 0 0 0)
    local mem_tot=$(< /proc/meminfo grep MemTotal | awk '{print $2}')
    local timeout
    local DONE
    local STOP=''

    [ $NXF_DEBUG = 1 ] && nxf_sleep 0.2 && ps fx

    while true; do
        nxf_stat $pid
        if [ $count -lt 10 ]; then timeout=1;
        elif [ $count -lt 120 ]; then timeout=5;
        else timeout=30;
        fi
        read -t $timeout -r DONE || true
        [[ $DONE ]] && break
        if [ ! -e /proc/$pid ]; then
            [ ! $STOP ] && STOP=$(nxf_date)
            [ $(($(nxf_date)-STOP)) -gt 10000 ] && break
        fi
        count=$((count+1))
    done

    echo "%mem=${nxf_stat_ret[1]}"      >> $trace_file
    echo "vmem=${nxf_stat_ret[2]}"      >> $trace_file
    echo "rss=${nxf_stat_ret[3]}"       >> $trace_file
    echo "peak_vmem=${nxf_stat_ret[4]}" >> $trace_file
    echo "peak_rss=${nxf_stat_ret[5]}"  >> $trace_file
    echo "vol_ctxt=${nxf_stat_ret[6]}"  >> $trace_file
    echo "inv_ctxt=${nxf_stat_ret[7]}"  >> $trace_file
}

nxf_write_trace() {
    echo "nextflow.trace/v2"           > $trace_file
    echo "realtime=$wall_time"         >> $trace_file
    echo "%cpu=$ucpu"                  >> $trace_file
    echo "rchar=${io_stat1[0]}"        >> $trace_file
    echo "wchar=${io_stat1[1]}"        >> $trace_file
    echo "syscr=${io_stat1[2]}"        >> $trace_file
    echo "syscw=${io_stat1[3]}"        >> $trace_file
    echo "read_bytes=${io_stat1[4]}"   >> $trace_file
    echo "write_bytes=${io_stat1[5]}"  >> $trace_file
}

nxf_trace_mac() {
    local start_millis=$(nxf_date)

    /bin/bash -ue /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/.command.sh

    local end_millis=$(nxf_date)
    local wall_time=$((end_millis-start_millis))
    local ucpu=''
    local io_stat1=('' '' '' '' '' '')
    nxf_write_trace
}

nxf_trace_linux() {
    local pid=$$
    local num_cpus=$(< /proc/cpuinfo grep '^processor' -c)
    local tot_time0=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}')
    local cpu_time0=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)*10 }' || echo -n 'X')
    local io_stat0=($(2> /dev/null < /proc/$pid/io sed 's/^.*:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0'))
    local start_millis=$(nxf_date)

    command -v ps &>/dev/null || { >&2 echo "Command 'ps' required by nextflow to collect task metrics cannot be found"; exit 1; }

    /bin/bash -ue /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/.command.sh &
    local task=$!

    exec 10> >(nxf_mem_watch $task)
    local mem_proc=$!

    wait $task

    local end_millis=$(nxf_date)
    local tot_time1=$(grep '^cpu ' /proc/stat | awk '{sum=$2+$3+$4+$5+$6+$7+$8+$9; printf "%.0f",sum}')
    local cpu_time1=$(2> /dev/null < /proc/$pid/stat awk '{printf "%.0f", ($16+$17)*10 }' || echo -n 'X')
    local ucpu=$(awk -v p1=$cpu_time1 -v p0=$cpu_time0 -v t1=$tot_time1 -v t0=$tot_time0 -v n=$num_cpus 'BEGIN { pct=(p1-p0)/(t1-t0)*100*n; printf("%.0f", pct>0 ? pct : 0) }' )

    local io_stat1=($(2> /dev/null < /proc/$pid/io sed 's/^.*:\s*//' | head -n 6 | tr '\n' ' ' || echo -n '0 0 0 0 0 0'))
    local i
    for i in {0..5}; do
        io_stat1[i]=$((io_stat1[i]-io_stat0[i]))
    done

    local wall_time=$((end_millis-start_millis))
    [ $NXF_DEBUG = 1 ] && echo "+++ STATS %CPU=$ucpu TIME=$wall_time I/O=${io_stat1[*]}"

    echo "nextflow.trace/v2"           > $trace_file
    echo "realtime=$wall_time"         >> $trace_file
    echo "%cpu=$ucpu"                  >> $trace_file
    echo "rchar=${io_stat1[0]}"        >> $trace_file
    echo "wchar=${io_stat1[1]}"        >> $trace_file
    echo "syscr=${io_stat1[2]}"        >> $trace_file
    echo "syscw=${io_stat1[3]}"        >> $trace_file
    echo "read_bytes=${io_stat1[4]}"   >> $trace_file
    echo "write_bytes=${io_stat1[5]}"  >> $trace_file

    echo 'DONE' >&10
    wait $mem_proc 2>/dev/null || true
    while [ -e /proc/$mem_proc ]; do nxf_sleep 0.1; done
    [ ${NXF_OWNER:=''} ] && chown -fR --from root $NXF_OWNER /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/{*,.*} || true
}

nxf_trace() {
    local trace_file=.command.trace
    touch $trace_file
    if [[ $(uname) = Darwin ]]; then
        nxf_trace_mac
    else
        nxf_trace_linux
    fi
}

nxf_date() {
    local ts=$(date +%s%3N); [[ $ts == *3N ]] && date +%s000 || echo $ts
}

nxf_env() {
    echo '============= task environment ============='
    env | sort | sed "s/\(.*\)AWS\(.*\)=\(.\{6\}\).*/\1AWS\2=\3xxxxxxxxxxxxx/"
    echo '============= task output =================='
}

nxf_kill() {
    declare -a children
    while read P PP;do
        children[$PP]+=" $P"
    done < <(ps -e -o pid= -o ppid=)

    kill_all() {
        [[ $1 != $$ ]] && kill $1 2>/dev/null || true
        for i in ${children[$1]:=}; do kill_all $i; done
    }

    kill_all $1
}

nxf_mktemp() {
    local base=${1:-/tmp}
    if [[ $(uname) = Darwin ]]; then mktemp -d $base/nxf.XXXXXXXXXX
    else TMPDIR="$base" mktemp -d -t nxf.XXXXXXXXXX
    fi
}

on_exit() {
    exit_status=${nxf_main_ret:=$?}
    printf $exit_status > /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/.exitcode
    set +u
    [[ "$tee1" ]] && kill $tee1 2>/dev/null
    [[ "$tee2" ]] && kill $tee2 2>/dev/null
    [[ "$ctmp" ]] && rm -rf $ctmp || true
    exit $exit_status
}

on_term() {
    set +e
    [[ "$pid" ]] && nxf_kill $pid
}

nxf_launch() {
    /bin/bash /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/.command.run nxf_trace
}

nxf_stage() {
    true
}

nxf_unstage() {
    true
    [[ ${nxf_main_ret:=0} != 0 ]] && return
}

nxf_main() {
    trap on_exit EXIT
    trap on_term TERM INT USR1 USR2

    NXF_SCRATCH=''
    [[ $NXF_DEBUG > 0 ]] && nxf_env
    touch /workspace/alucinor/work/ca/10407be0e24fb80121ed298ecb6d23/.command.begin
    set +u
    set -u
    [[ $NXF_SCRATCH ]] && echo "nxf-scratch-dir $HOSTNAME:$NXF_SCRATCH" && cd $NXF_SCRATCH
    nxf_stage

    set +e
    local ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
    local cout=$ctmp/.command.out; mkfifo $cout
    local cerr=$ctmp/.command.err; mkfifo $cerr
    tee .command.out < $cout &
    tee1=$!
    tee .command.err < $cerr >&2 &
    tee2=$!
    ( nxf_launch ) >$cout 2>$cerr &
    pid=$!
    wait $pid || nxf_main_ret=$?
    wait $tee1 $tee2
    nxf_unstage
}

$NXF_ENTRY

cat .command.sh

diamond makedb       --threads 2       --in /annotater/uniprot_sprot/uniprot_sprot.fasta       --db uniprot_sprot

I am at a complete loss, am I typing the basic command wrong? This is what I run:

nextflow -C custom_nextflow.conf kuberun SystemsGenetics/AnnoTater -v deepgtex-prp -profile k8s
4ctrl-alt-del commented 4 years ago

The containers now at least run in NRP. They immediately go to an error state because the diamond/interproscan programs fail. But at least now they have meaningful failure logs and get past container creation so I am closing this. A new issue has been made for the new type of annotater failure.