Closed mmiladi closed 5 years ago
Hey, i am wondering about the minus 2 in --cpu "\${GALAXY_SLOTS:-2}"
From : http://planemo.readthedocs.io/en/latest/writing_advanced.html#cluster-usage
Here we use \${GALAXY_SLOTS:-Z} instead of a fixed value (Z being an integer representing a default value in non-Galaxy contexts). The backslash here is because this value is interpreted at runtime as environment variable - not during command building time as a templated value. Now server administrators can configure how many processes the tool should be allowed to use.
The minus was also confusing for me. :D
Can you try to set $GALAXY_SLOTS=4 during container start?
docker run -i -t -p 8890:80 -e GALAXY_SLOTS=4
also runs single core. Is the command format correct?
Could this be related to ? : https://github.com/bgruening/docker-galaxy-stable/blob/master/compose/galaxy-htcondor-executor/startup.sh#L9
@mmiladi, you are not using the composed setup, aren't you? So this can not be the reason. I think that Galaxy sets this to 1 at some point.
@bgruening Manually installed infernal inside a bgruening/galaxy-stable
instance. Same behavior as docker-galaxygraphclust .
Update: Galaxy startup services complain about coreutils. Could this be the reason?
==> /home/galaxy/logs/slurmctld.log <==
[2017-11-22T15:46:44.902] No trigger state file (/tmp/slurm/trigger_state.old) to recover
[2017-11-22T15:46:44.902] error: Incomplete trigger data checkpoint file
[2017-11-22T15:46:44.902] read_slurm_conf: backup_controller not specified.
[2017-11-22T15:46:44.902] Reinitializing job accounting state
[2017-11-22T15:46:44.902] cons_res: select_p_reconfigure
[2017-11-22T15:46:44.902] cons_res: select_p_node_init
[2017-11-22T15:46:44.902] cons_res: preparing for 1 partitions
[2017-11-22T15:46:44.902] Running as primary controller
[2017-11-22T15:46:44.906] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2017-11-22T15:46:44.916] error: slurm_receive_msg: Zero Bytes were transmitted or received
tail: unrecognized file system type 0x794c7630 for ‘/home/galaxy/logs/slurmctld.log’. please report this to bug-coreutils@gnu.org. reverting to polling
==> /home/galaxy/logs/slurmd.log <==
[2017-11-22T15:46:44.902] CPU frequency setting not configured for this node
[2017-11-22T15:46:44.904] slurmd version 2.6.5 started
[2017-11-22T15:46:44.904] slurmd started on Wed, 22 Nov 2017 15:46:44 +0000
[2017-11-22T15:46:44.905] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=7982 TmpDisk=99203 Uptime=104128
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
[2017-11-22T15:46:44.906] error: authentication: Socket communication error
[2017-11-22T15:46:44.906] error: Unable to register: Protocol authentication error
tail: unrecognized file system type 0x794c7630 for ‘/home/galaxy/logs/slurmd.log’. please report this to bug-coreutils@gnu.org. reverting to polling
SLURM env variable during the tool runtime:
declare -x SLURM_CPUS_ON_NODE="1"
declare -x SLURM_GTIDS="0"
declare -x SLURM_JOBID="4"
declare -x SLURM_JOB_CPUS_PER_NODE="1"
declare -x SLURM_JOB_ID="4"
declare -x SLURM_JOB_NODELIST="33e5fe8c8bdf"
declare -x SLURM_JOB_NUM_NODES="1"
declare -x SLURM_LOCALID="0"
declare -x SLURM_NNODES="1"
declare -x SLURM_NODEID="0"
declare -x SLURM_NODELIST="33e5fe8c8bdf"
declare -x SLURM_NODE_ALIASES="(null)"
declare -x SLURM_NPROCS="1"
declare -x SLURM_NTASKS="1"
declare -x SLURM_PROCID="0"
declare -x SLURM_TASKS_PER_NODE="1"
declare -x SLURM_TASK_PID="2004"
declare -x SLURM_TOPOLOGY_ADDR="33e5fe8c8bdf"
declare -x SLURM_TOPOLOGY_ADDR_PATTERN="node"
cmcalibrate
is a very time consuming process, which is an integrated part ofcmbuild galaxy
wrapper. Multi-threading is critically demanded for running the calibration. Currently 2 slots (--cpu "\${GALAXY_SLOTS:-2}"
) are hard-coded in the wrapper: https://github.com/bgruening/galaxytools/blob/master/tools/rna_tools/infernal/cmbuild.xml#L90For an unknown reason the galaxy docker instance only assigns one process to the task (run time process info:
--cpu 1
), The issue might be related to slurm scheduler, log comes below: