Closed jmchilton closed 6 years ago
use_legacy_home on the command
. Only new tools on new Galaxy servers can use this argument. If a tool has a real problem with this new behavior the tool can designed in a better way I assume. This flag add nothing useful for backward compatibility, isn't it?Thanks John for the writeup!
If a tool has a real problem with this new behavior the tool can designed in a better way I assume.
I think there are going to be local tools that use $HOME
in legitimate ways for local uses. There are definitely applications for which $HOME
is the correct way to share this concept I think. I think this should an option, we should just discourage its use in "best practice" tools.
TEMPDIR also exists in some distributions
Any reference for that?
@jmchilton: Should the temp base default to pwd
when tmp_dir
is unset and profile >= 18.01
? I feel like it should be outside the working directory, but inside the job directory (so basically, ..
). There may be things that attempt to copy around the entire contents of the working directory and we don't want them copying the contents of any temp directories.
@bgruening: I agree with respect to use_legacy_home
. I think admins probably want/need a way to disable both features in the job conf, however.
Related to that second point, if it's disable-able, tools cannot assume that any of $TMP
, $TMPDIR
or $TEMP
are set. If they're needed inside the wrapper command line they should probably be referenced like ${TMPDIR:-$(pwd)}
.
@jmchilton: if the manipulation of $HOME
occurs in the job script after the inclusion of <env>
variables, then you could potentially do something like <env id="ORIG_HOME">$HOME</env>
and then use HOME=${ORIG_HOME:-$HOME}
in your tool wrapper. Maybe we should just set an env var for that by default, in fact... like $_GALAXY_ORIG_HOME
?
@nsoranzo https://en.wikipedia.org/wiki/TMPDIR
Related to that second point, if it's disable-able, tools cannot assume that any of $TMP, $TMPDIR or $TEMP are set. If they're needed inside the wrapper command line they should probably be referenced like ${TMPDIR:-$(pwd)}.
Profile 18.01 will be guaranteed all three are set. Older tools won't, but they aren't now right?
@jmchilton: Should the temp base default to pwd when tmp_dir is unset and profile >= 18.01? I feel like it should be outside the working directory, but inside the job directory (so basically, ..). There may be things that attempt to copy around the entire contents of the working directory and we don't want them copying the contents of any temp directories.
Yes... of course. I'll update this.
@bgruening I mean a real example ;)
https://docs.python.org/2/library/tempfile.html#tempfile.tempdir doesn't list it and most Google hits for "tempdir" environment variable
are typos.
@jmchilton: I just worry that some admins are not going to be okay with those 3 vars being forcibly set/overridden. It's also sorta counter to the UNIX philosopy (and the actual POSIX spec) that $TMPDIR
be required to be set, programs are supposed to use a sensible default (/tmp
) if it's not. To get "expected POSIX behavior" you'll have to set tmp_dir=/tmp
and even then the program may behave differently with $TMPDIR
set than it would have without it.
Also, /tmp
is sometimes in memory, which gives you fairly limited space but much greater performance. Tools may need a way to pick between "big and slow" and "fast and small".
@nsoranzo I'm sure I have seen this in real life. Not sure where. But just have a look at: https://github.com/search?l=Shell&q=export+TEMPDIR&type=Code&utf8=%E2%9C%93 ... people use all obscure stuff. That is one reason we should handle this on a job level by an admin, so that a tool dev should not care about this much.
@natefoo
So you'd want a set_tmp_dir
boolean in job_conf to disable all of that? I will add that but I'm not happy about it and I don't think people should use it.
Also, /tmp is sometimes in memory, which gives you fairly limited space but much greater performance. Tools may need a way to pick between "big and slow" and "fast and small".
This provides the path for controlling these things much more clearly IMO. One place in the job_conf to control all this for any tool for any of the common environment variables.
It's also sorta counter to the UNIX philosopy (and the actual POSIX spec) that $TMPDIR be required to be set, programs are supposed to use a sensible default (/tmp) if it's not.
I'm saying we will set this now - for all tools. That first part is done, the other half of that sentence is not applicable but if it was the sensible default of /tmp
is not sensible for reproducibility.
@bgruening By looking at some of those scripts, TEMPDIR
is just the name of a local variable used for a temporary directory, not for getting the system temporary directory set by the sysadmin.
So you'd want a set_tmp_dir boolean in job_conf to disable all of that? I will add that but I'm not happy about it and I don't think people should use it.
Well, it wouldn't be possible to use it if you want to use profile>=18.01
tools and those tools expect those vars to be set.
This provides the path for controlling these things much more clearly IMO. One place in the job_conf to control all this for any tool for any of the common environment variables.
We'd need standard vars for this, though, that we guarantee to exist, (sound familiar? ha) for them to be used globally by tool authors. In this case, however, it would be preferable to me to use something like $_GALAXY_BIG_SCRATCH
and $_GALAXY_SMALL_SCRATCH
or something. I don't like those names, but that's the idea. A wrapper can then set $TMPDIR
and the like based on what's suitable for the wrapped binary.
I don't know. Maybe I am fretting too much. I just think like an old UNIX admin and this heavy handed approach would rub me the wrong way, but ultimately it will probably work fine. I think it just needs to be documented in big blinking text that Galaxy will mess with these vars.
Well, it wouldn't be possible to use it if you want to use profile>=18.01 tools and those tools expect those vars to be set.
You could manually set those variables in number of ways still right? So are you saying I don't need to add this?
Yes, but they'd have to be set, and that's not very POSIX-y.
@natefoo So you would prefer the current status quo? I can ignore this problem for Galaxy tools if you'd like. I think it will be much easier to maintain and configure Galaxy servers going forward if we address this with a heavy hand but I won't if you don't want me to.
No, the status quo has problems. For Main, we set $TEMP
and $TMPDIR
to $(dirname ${BASH_SOURCE[0]})/_galaxy_tmp
using <env>
on our destinations because of this. But there should be a "most likely to work" configuration out of the box so people don't have to "discover" that they might/probably want to set them, and your proposal accomplishes this.
Also, I guess I can convince myself that it's not really egregious since the Galaxy framework itself will honor whatever you configure for the temp variables in the environment in which you start it. The fact that it overrides them in a configurable manner for a controlled execution environment is ok.
Next question - CWL defines HOME
to be the pwd
, but if we are going to define a tmp
directory for jobs I guess for Galaxy tools we'd expect not to collect files from that directory so that tmp directory might be a better default than pwd
for Galaxy. The framework will still be updated to allow CWL tools to point this at HOME but it'd be better for Galaxy to point at this temp directory I'm thinking.
I'd suggest a second empty directory like pwd/../home
instead of reusing the temp directory.
What about adding something like this to the template for tool_script.sh :
export TMPDIR=$(mktemp -d "${TMPDIR:-$job_dir}/tmp.XXXXXXXXX")
export TEMP=$TMPDIR
export TMP=$TMPDIR
where $job_dir
obviously is substituted with the actual job directory when creating the script.
In this way we respect the admin-configured TMPDIR but guarantee a new directory for every job.
@nsoranzo I've got issues with this proposal but I need to know if you are suggestion this for all tools or certain profiles of tools (e.g. < 18.01 or >= 18.01)?
@jmchilton All tools. This won't break assumptions of old tools or admin-configured environment variables. It also satisfies the "goal of providing tools with isolated temp directories". What are the issues?
@natefoo can obtain its present configuration simply by setting TMPDIR
to the empty value in job_conf.xml
.
@nsoranzo Thanks for the clarification. One more question before the pushback - are you arguing against making it settable via a property that Galaxy can reason about ahead of time? Like <param id="tmp_dir">/path/to/tmp</param>
the way that is discussed above or are you saying that if that is not set we should use your strategy as a fallback?
I think that setting it with:
<env id="TMPDIR">/my/temp/dir</env>
should be enough and there is no need to introduce Galaxy-specific variables.
@nsoranzo I have some problems with this for sure then.
If I see this as the admin:
<env id="TMPDIR">/my/temp/dir</env>
I would definitely expect TMPDIR
to be set to that directory and not a subdirectory of that directory in the job. All the other env
variables would be passed through as is, Galaxy doesn't dispatch on them and do different things with them and Galaxy certainly doesn't actively rewrite them (to be a subdirectory in the instance). If we are going to process the parameter and use it - it should be a param
not an env
. For instance in the local runner I encourage <param id="local_slots">
instead of just using <env id="GALAXY_SLOTS">
. I think we want the ability to set this temp directory and reason about it from within Galaxy itself, for different kinds of tools, for Pulsar, for Docker, etc.... I think it would be more clear we were doing that with a param
.
There are a lot of cool things we could imagine doing with object stores, Pulsar, and Docker with regards to this variable. We could have a parameter that would have Galaxy use the object store closest to the input data to define a temp. We could have a parameter for allocating in-memory tmp on large memory systems. We could tweak cloud mounts or Docker mounts with dynamic rules based on estimations of tmp space used. In addition to admin defined Galaxy dynamic job destinations - one could imagine tools defining constraints for tmp as well. The CWL for instance provides a syntax for specifying the amount of tmp needed (http://www.commonwl.org/v1.0/CommandLineTool.html#ResourceRequirement). I don't want to just dispatch on it at runtime on the cluster - I want it in Galaxy - I want it before the job has been submitted. If it is in Galaxy, I want it as a param
not an env
with different semantics than all the other env
s.
I think it in addition to it being surprising that TMPDIR is being rewritten, I think it would be surprising that TMP and TEMP are being rewritten based on that. The one advantage of your approach of my preferred approach is that existing admin configurations are less likely to be broken. But if I'm one of those admins and I have set all three - why should just one be preserved and not the others? I get that is the right one to preserve if we were going to go this route - it is just a bit surprising that a sort of raw value like an env
value would work that way.
So I strongly feel with should have a dedicated job_conf / galaxy.ini syntax for setting up the base of temp directory as I outlined in this comment above. I feel less strongly that we shouldn't use TMPDIR
as a default but instead use the jobs_directory/<job_hash>/tmp.XXXX
(i.e. working_directory/../tmp.XXX
). It is a less strong belief but I still think it is the correct out-of-box default.
If I knew the Galaxy admin had set up $TMPDIR
I would definitely be okay with it being the default. But if it is just some system setting, I think 54.6% of the time it would be better to use something next to Galaxy's working directory by default - it is more likely to have enough space and more likely to work out of the box. I've been burned by Galaxy filling up tiny /tmp
directories and such before at MSI and on cloud nodes. jobs_directory
is going to be more likely to have more space allocated I think. If there are disk issues - the admin won't need to figure out if they are in /tmp or in the working directory - there is one place by default to fix these things. If other things are running on the cluster and filling up tmp Galaxy is less likely to break as well. It also has the advantage of making things easier to debug - if I turn cleanup jobs off the job's temp files won't be cleaned up. This should in theory be great for replicating Galaxy's command-lines and re-running failed jobs to see what happened.
Also - one of those potential defaults is knowable ahead of time and the other is not. So as we do more sophisticated things with job staging - it will be easier to do these awesome things with less configuration if we go with the knowable default instead of the cluster defined default.
A job conf setting for this that defaults to something the working directory is the thing that is going to be the easiest to debug, least likely to break after some simple modifications to Galaxy's configuration, and the easiest to understand for future admins.
I can add a setting for disabling this or make it only apply to 18.01+ configurations (if we do switch the configuration file to YAML this release) - so this doesn't have to get in the way of admins wanting to manually tweak things with env
directives. It isn't a level of abstraction I want Galaxy admins to have to work at, and it certainly isn't a level of abstraction I want to have to reason about within Galaxy - but it is a level of abstraction I'd be happy to give you a flag to enable if you want to keep working that way.
I don't have time now for a detailed response, but I just want to add that if the TMPDIR is set by the cluster admin on the node, this will be respected with my approach, while that is not possible with a variable set by a param
in job_conf.xml
.
I've made the case that I don't believe deferring to that should be the out of the box default but Ill provide a flag for instances that would like to do this.
Question: there are some command-line tools (e.g ete3, tectonic) which cache downloaded data in hidden directories (~/.etetoolkit/
, ~/.cache/Tectonic/
). This is useful because:
I have two ideas which can help:
use_legacy_home
to use_cached_home
or something like thatgalaxy.ini
to configure the HOME used when use_cached_home="true"
is specified. That's not strictly necessary since the admin can already specify a HOME environment variable with the environment_setup_file
directive or per-destination in job_conf.xml
.rename use_legacy_home to use_cached_home or something like that
Good point - this is entirely valid use case. I'll call it use_shared_home
- I'll try to make it settable in galaxy's config or job_conf on a per destination basis - it should cover all the bases.
We had a long conversation in Gitter about this - we have a goal of providing tools with isolated HOME and temp directories with the minimal complexity of admin configuration while being as backward compatible as possible.
The outcome (I think - I will revise this as we discuss it further):
tmp_dir
andlegacy_home_dir
.tmp_dir
is set, all tools will haveTMP
,TMPDIR
, andTEMP
set to a new directory within that directory. This new directory will be empty and created in the job script so that it can be remote scratch. There will be a mechanism by which an existing job cluster supplied environment variable can be used for the basetmp_dir
value.tmp_dir
is not set, Galaxy will not set these three environment variables forprofile < 18.01
tools.tmp_dir
is not set, Galaxy will set these three environment variables to<job_directory>/tmp
(e.g.pwd/../tmp
by default) forprofile >= 18.01
tools.legacy_home_dir
is set and a tool specifiesuse_legacy_home
on thecommand
element or isprofile < 18.01
,HOME
will be set to that variable.profile >= 18.01
or older tools withuse_legacy_home="false"
set oncommand
element,HOME
will be set topwd
.HOME
andTMP
handling from environment variables and environment files injob_conf
and switch to the new job conf variables to ensure the continued correct usage of tools.