galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.4k stars 1k forks source link

Enhance Job Metrics Framework and Integration #5134

Open jmchilton opened 6 years ago

jmchilton commented 6 years ago

Two job metrics Trello cards existed and had good points that never seemed to make it to Github.

https://trello.com/c/uAcQYz5I/1606-job-metrics-reports-and-interpretation covered integration of job metrics with the reports app.

https://trello.com/c/XsQdqliU/1607-job-metrics-technical-enhancements

Newer Ideas:

hexylena commented 6 years ago

Maybe to add to this: custom site-specific metric plugins loaded from a directory?

We have some extra facets of information that we could collect that aren't something that would be generally useful. Specific example, we store a 'build tag' in /etc/vgcn-release (indicating the version of the VM that the job is running on) that we could collect. It might be useful to have this tagged to the job for debugging purposes

nsoranzo commented 6 years ago

@erasche For your specific example, you can probably use the <env /> plugin if you export that information as an environment variable.

hexylena commented 6 years ago

@nsoranzo huh, interesting idea. I'll test that! Thanks

hexylena commented 6 years ago

Ah, bit by an old 'bug'.

Looks like that's not an option in my case, htcondor cleans the environment before running a job. It's something we've noticed earlier, and causes problems for us in other cases as well, HTCondor overrides TMPDIR settings leading to us manually patching the upload tool. Please excuse the exasperated commit message.

example of the environment a condor job gets by by default (though galaxy does set the option to pass through its environment)

BATCH_SYSTEM=HTCondor
OMP_NUM_THREADS=1
PWD=/data/dnb01/condor-galaxy
SHLVL=1
TEMP=/var/lib/condor/execute/dir_3563
TMP=/var/lib/condor/execute/dir_3563
TMPDIR=/var/lib/condor/execute/dir_3563
_=/usr/bin/env
_CHIRP_DELAYED_UPDATE_PREFIX=Chirp
_CONDOR_ANCESTOR_1303=1890:1539771006:3869518212
_CONDOR_ANCESTOR_1890=3563:1539784048:1775810939
_CONDOR_ANCESTOR_3563=3564:1539784048:1021639213
_CONDOR_CHIRP_CONFIG=/var/lib/condor/execute/dir_3563/.chirp.config
_CONDOR_JOB_AD=/var/lib/condor/execute/dir_3563/.job.ad
_CONDOR_JOB_IWD=/data/dnb01/condor-galaxy
_CONDOR_JOB_PIDS=
_CONDOR_MACHINE_AD=/var/lib/condor/execute/dir_3563/.machine.ad
_CONDOR_SCRATCH_DIR=/var/lib/condor/execute/dir_3563
_CONDOR_SLOT=slot1_1

vs the environment of the machine if you just log in normally:

HISTCONTROL=ignoredups
HISTSIZE=1000
HOME=/home/centos
HOSTNAME=vgcnbwc-training-beta-0.novalocal
LANG=en_US.UTF-8
LESSOPEN=||/usr/bin/lesspipe.sh %s
LOGNAME=centos
LS_COLORS=
MAIL=/var/spool/mail/centos
PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/centos/.local/bin:/home/centos/bin
PWD=/home/centos
SELINUX_LEVEL_REQUESTED=
SELINUX_ROLE_REQUESTED=
SELINUX_USE_CURRENT_RANGE=
SHELL=/bin/bash
SHLVL=1
SSH_CLIENT=132.230.68.5 8078 22
SSH_CONNECTION=132.230.68.5 8078 10.5.68.18 22
SSH_TTY=/dev/pts/1
USER=centos
_=/usr/bin/env
VGCN_RELEASE=CentOS 7.5.1804 VGGP vggp-v31-j95-9c1a332fb4d7-master

It broke the hostname plugin for us so we had to make some strange changes to that.