cmbuild/cmcalibrate multi-threading

mmiladi commented 6 years ago

cmcalibrate is a very time consuming process, which is an integrated part of cmbuild galaxy wrapper. Multi-threading is critically demanded for running the calibration. Currently 2 slots (--cpu "\${GALAXY_SLOTS:-2}") are hard-coded in the wrapper: https://github.com/bgruening/galaxytools/blob/master/tools/rna_tools/infernal/cmbuild.xml#L90

For an unknown reason the galaxy docker instance only assigns one process to the task (run time process info: --cpu 1), The issue might be related to slurm scheduler, log comes below:

cmbuild -F --fast --symfrac 0.5  --wpb --p7ere 0.38  --EmN 200 --EvN 200 --ElfN 200 --EgfN 200 --enone '/export/galaxy-central/database/files/000/dataset_144.dat' '/export/galaxy-central/database/files/000/dataset_106.dat'   && cmcalibrate -L 1.6 --gtailn 250 --ltailn 750 --seed 181 --beta 1e-15    --cpu "${GALAXY_SLOTS:-2}" '/export/galaxy-central/database/files/000/dataset_144.dat']
galaxy.jobs.runners DEBUG 2017-11-21 14:19:04,306 (50) command is: rm -rf working; mkdir -p working; cd working; /export/galaxy-central/database/job_working_directory/000/50/tool_script.sh; return_code=$?; cd '/export/galaxy-central/database/job_working_directory/000/50'; 
[ "$GALAXY_VIRTUAL_ENV" = "None" ] && GALAXY_VIRTUAL_ENV="$_GALAXY_VIRTUAL_ENV"; _galaxy_setup_environment True
python "/export/galaxy-central/database/job_working_directory/000/50/set_metadata_g9oJJe.py" "/export/galaxy-central/database/job_working_directory/000/50/registry.xml" "/export/galaxy-central/database/job_working_directory/000/50/working/galaxy.json" "/export/galaxy-central/database/job_working_directory/000/50/metadata_in_HistoryDatasetAssociation_144_z9rjR5,/export/galaxy-central/database/job_working_directory/000/50/metadata_kwds_HistoryDatasetAssociation_144_DptCYc,/export/galaxy-central/database/job_working_directory/000/50/metadata_out_HistoryDatasetAssociation_144_9gjdSY,/export/galaxy-central/database/job_working_directory/000/50/metadata_results_HistoryDatasetAssociation_144_Jichyc,/export/galaxy-central/database/files/000/dataset_144.dat,/export/galaxy-central/database/job_working_directory/000/50/metadata_override_HistoryDatasetAssociation_144_asnY13" 5242880; sh -c "exit $return_code"
galaxy.jobs.runners.drmaa DEBUG 2017-11-21 14:19:04,388 (50) submitting file /export/galaxy-central/database/job_working_directory/000/50/galaxy_50.sh
galaxy.jobs.runners.drmaa DEBUG 2017-11-21 14:19:04,388 (50) native specification is: --ntasks=1 --share
galaxy.jobs.runners.drmaa INFO 2017-11-21 14:19:04,392 (50) queued as 51
galaxy.jobs DEBUG 2017-11-21 14:19:04,392 (50) Persisting job destination (destination id: slurm_cluster)

==> /home/galaxy/logs/slurmctld.log <==
[2017-11-21T14:19:04.391] _slurm_rpc_submit_batch_job JobId=51 usec=1189
[2017-11-21T14:19:04.395] sched: Allocate JobId=51 NodeList=c9a44253b50b #CPUs=1

==> /home/galaxy/logs/slurmd.log <==
[2017-11-21T14:19:04.413] Launching batch job 51 for UID 1450

==> /home/galaxy/logs/handler1.log <==
galaxy.jobs.runners.drmaa DEBUG 2017-11-21 14:19:04,848 (50/51) state change: job is running

eggzilla commented 6 years ago

Hey, i am wondering about the minus 2 in --cpu "\${GALAXY_SLOTS:-2}"

mmiladi commented 6 years ago

From : http://planemo.readthedocs.io/en/latest/writing_advanced.html#cluster-usage

Here we use \${GALAXY_SLOTS:-Z} instead of a fixed value (Z being an integer representing a default value in non-Galaxy contexts). The backslash here is because this value is interpreted at runtime as environment variable - not during command building time as a templated value. Now server administrators can configure how many processes the tool should be allowed to use.

The minus was also confusing for me. :D

bgruening commented 6 years ago

Can you try to set $GALAXY_SLOTS=4 during container start?

mmiladi commented 6 years ago

docker run -i -t -p 8890:80 -e GALAXY_SLOTS=4 also runs single core. Is the command format correct?

Could this be related to ? : https://github.com/bgruening/docker-galaxy-stable/blob/master/compose/galaxy-htcondor-executor/startup.sh#L9

bgruening commented 6 years ago

@mmiladi, you are not using the composed setup, aren't you? So this can not be the reason. I think that Galaxy sets this to 1 at some point.

mmiladi commented 6 years ago

@bgruening Manually installed infernal inside a bgruening/galaxy-stable instance. Same behavior as docker-galaxygraphclust .

mmiladi commented 6 years ago

Update: Galaxy startup services complain about coreutils. Could this be the reason?

==> /home/galaxy/logs/slurmctld.log <==
[2017-11-22T15:46:44.902] No trigger state file (/tmp/slurm/trigger_state.old) to recover
[2017-11-22T15:46:44.902] error: Incomplete trigger data checkpoint file
[2017-11-22T15:46:44.902] read_slurm_conf: backup_controller not specified.
[2017-11-22T15:46:44.902] Reinitializing job accounting state
[2017-11-22T15:46:44.902] cons_res: select_p_reconfigure
[2017-11-22T15:46:44.902] cons_res: select_p_node_init
[2017-11-22T15:46:44.902] cons_res: preparing for 1 partitions
[2017-11-22T15:46:44.902] Running as primary controller
[2017-11-22T15:46:44.906] error: slurm_receive_msg: Zero Bytes were transmitted or received
[2017-11-22T15:46:44.916] error: slurm_receive_msg: Zero Bytes were transmitted or received
tail: unrecognized file system type 0x794c7630 for ‘/home/galaxy/logs/slurmctld.log’. please report this to bug-coreutils@gnu.org. reverting to polling

==> /home/galaxy/logs/slurmd.log <==
[2017-11-22T15:46:44.902] CPU frequency setting not configured for this node
[2017-11-22T15:46:44.904] slurmd version 2.6.5 started
[2017-11-22T15:46:44.904] slurmd started on Wed, 22 Nov 2017 15:46:44 +0000
[2017-11-22T15:46:44.905] CPUs=8 Boards=1 Sockets=8 Cores=1 Threads=1 Memory=7982 TmpDisk=99203 Uptime=104128
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory (retrying ...)
[2017-11-22T15:46:44.906] error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
[2017-11-22T15:46:44.906] error: authentication: Socket communication error
[2017-11-22T15:46:44.906] error: Unable to register: Protocol authentication error
tail: unrecognized file system type 0x794c7630 for ‘/home/galaxy/logs/slurmd.log’. please report this to bug-coreutils@gnu.org. reverting to polling

mmiladi commented 6 years ago

SLURM env variable during the tool runtime:

declare -x SLURM_CPUS_ON_NODE="1"
declare -x SLURM_GTIDS="0"
declare -x SLURM_JOBID="4"
declare -x SLURM_JOB_CPUS_PER_NODE="1"
declare -x SLURM_JOB_ID="4"
declare -x SLURM_JOB_NODELIST="33e5fe8c8bdf"
declare -x SLURM_JOB_NUM_NODES="1"
declare -x SLURM_LOCALID="0"
declare -x SLURM_NNODES="1"
declare -x SLURM_NODEID="0"
declare -x SLURM_NODELIST="33e5fe8c8bdf"
declare -x SLURM_NODE_ALIASES="(null)"
declare -x SLURM_NPROCS="1"
declare -x SLURM_NTASKS="1"
declare -x SLURM_PROCID="0"
declare -x SLURM_TASKS_PER_NODE="1"
declare -x SLURM_TASK_PID="2004"
declare -x SLURM_TOPOLOGY_ADDR="33e5fe8c8bdf"
declare -x SLURM_TOPOLOGY_ADDR_PATTERN="node"

BackofenLab / GraphClust-2

cmbuild/cmcalibrate multi-threading #57