galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.4k stars 1.01k forks source link

Reworking HOME and TMPDIR handling in 18.01. #4606

Closed jmchilton closed 6 years ago

jmchilton commented 7 years ago

We had a long conversation in Gitter about this - we have a goal of providing tools with isolated HOME and temp directories with the minimal complexity of admin configuration while being as backward compatible as possible.

The outcome (I think - I will revise this as we discuss it further):

bgruening commented 7 years ago

Thanks John for the writeup!

jmchilton commented 7 years ago

If a tool has a real problem with this new behavior the tool can designed in a better way I assume.

I think there are going to be local tools that use $HOME in legitimate ways for local uses. There are definitely applications for which $HOME is the correct way to share this concept I think. I think this should an option, we should just discourage its use in "best practice" tools.

nsoranzo commented 7 years ago

TEMPDIR also exists in some distributions

Any reference for that?

natefoo commented 7 years ago

@jmchilton: Should the temp base default to pwd when tmp_dir is unset and profile >= 18.01? I feel like it should be outside the working directory, but inside the job directory (so basically, ..). There may be things that attempt to copy around the entire contents of the working directory and we don't want them copying the contents of any temp directories.

@bgruening: I agree with respect to use_legacy_home. I think admins probably want/need a way to disable both features in the job conf, however.

Related to that second point, if it's disable-able, tools cannot assume that any of $TMP, $TMPDIR or $TEMP are set. If they're needed inside the wrapper command line they should probably be referenced like ${TMPDIR:-$(pwd)}.

natefoo commented 7 years ago

@jmchilton: if the manipulation of $HOME occurs in the job script after the inclusion of <env> variables, then you could potentially do something like <env id="ORIG_HOME">$HOME</env> and then use HOME=${ORIG_HOME:-$HOME} in your tool wrapper. Maybe we should just set an env var for that by default, in fact... like $_GALAXY_ORIG_HOME?

bgruening commented 7 years ago

@nsoranzo https://en.wikipedia.org/wiki/TMPDIR

jmchilton commented 7 years ago

Related to that second point, if it's disable-able, tools cannot assume that any of $TMP, $TMPDIR or $TEMP are set. If they're needed inside the wrapper command line they should probably be referenced like ${TMPDIR:-$(pwd)}.

Profile 18.01 will be guaranteed all three are set. Older tools won't, but they aren't now right?

jmchilton commented 7 years ago

@jmchilton: Should the temp base default to pwd when tmp_dir is unset and profile >= 18.01? I feel like it should be outside the working directory, but inside the job directory (so basically, ..). There may be things that attempt to copy around the entire contents of the working directory and we don't want them copying the contents of any temp directories.

Yes... of course. I'll update this.

nsoranzo commented 7 years ago

@bgruening I mean a real example ;) https://docs.python.org/2/library/tempfile.html#tempfile.tempdir doesn't list it and most Google hits for "tempdir" environment variable are typos.

natefoo commented 7 years ago

@jmchilton: I just worry that some admins are not going to be okay with those 3 vars being forcibly set/overridden. It's also sorta counter to the UNIX philosopy (and the actual POSIX spec) that $TMPDIR be required to be set, programs are supposed to use a sensible default (/tmp) if it's not. To get "expected POSIX behavior" you'll have to set tmp_dir=/tmp and even then the program may behave differently with $TMPDIR set than it would have without it.

Also, /tmp is sometimes in memory, which gives you fairly limited space but much greater performance. Tools may need a way to pick between "big and slow" and "fast and small".

bgruening commented 7 years ago

@nsoranzo I'm sure I have seen this in real life. Not sure where. But just have a look at: https://github.com/search?l=Shell&q=export+TEMPDIR&type=Code&utf8=%E2%9C%93 ... people use all obscure stuff. That is one reason we should handle this on a job level by an admin, so that a tool dev should not care about this much.

jmchilton commented 7 years ago

@natefoo

So you'd want a set_tmp_dir boolean in job_conf to disable all of that? I will add that but I'm not happy about it and I don't think people should use it.

Also, /tmp is sometimes in memory, which gives you fairly limited space but much greater performance. Tools may need a way to pick between "big and slow" and "fast and small".

This provides the path for controlling these things much more clearly IMO. One place in the job_conf to control all this for any tool for any of the common environment variables.

It's also sorta counter to the UNIX philosopy (and the actual POSIX spec) that $TMPDIR be required to be set, programs are supposed to use a sensible default (/tmp) if it's not.

I'm saying we will set this now - for all tools. That first part is done, the other half of that sentence is not applicable but if it was the sensible default of /tmp is not sensible for reproducibility.

nsoranzo commented 7 years ago

@bgruening By looking at some of those scripts, TEMPDIR is just the name of a local variable used for a temporary directory, not for getting the system temporary directory set by the sysadmin.

natefoo commented 7 years ago

So you'd want a set_tmp_dir boolean in job_conf to disable all of that? I will add that but I'm not happy about it and I don't think people should use it.

Well, it wouldn't be possible to use it if you want to use profile>=18.01 tools and those tools expect those vars to be set.

This provides the path for controlling these things much more clearly IMO. One place in the job_conf to control all this for any tool for any of the common environment variables.

We'd need standard vars for this, though, that we guarantee to exist, (sound familiar? ha) for them to be used globally by tool authors. In this case, however, it would be preferable to me to use something like $_GALAXY_BIG_SCRATCH and $_GALAXY_SMALL_SCRATCH or something. I don't like those names, but that's the idea. A wrapper can then set $TMPDIR and the like based on what's suitable for the wrapped binary.

I don't know. Maybe I am fretting too much. I just think like an old UNIX admin and this heavy handed approach would rub me the wrong way, but ultimately it will probably work fine. I think it just needs to be documented in big blinking text that Galaxy will mess with these vars.

jmchilton commented 7 years ago

Well, it wouldn't be possible to use it if you want to use profile>=18.01 tools and those tools expect those vars to be set.

You could manually set those variables in number of ways still right? So are you saying I don't need to add this?

natefoo commented 7 years ago

Yes, but they'd have to be set, and that's not very POSIX-y.

jmchilton commented 7 years ago

@natefoo So you would prefer the current status quo? I can ignore this problem for Galaxy tools if you'd like. I think it will be much easier to maintain and configure Galaxy servers going forward if we address this with a heavy hand but I won't if you don't want me to.

natefoo commented 7 years ago

No, the status quo has problems. For Main, we set $TEMP and $TMPDIR to $(dirname ${BASH_SOURCE[0]})/_galaxy_tmp using <env> on our destinations because of this. But there should be a "most likely to work" configuration out of the box so people don't have to "discover" that they might/probably want to set them, and your proposal accomplishes this.

Also, I guess I can convince myself that it's not really egregious since the Galaxy framework itself will honor whatever you configure for the temp variables in the environment in which you start it. The fact that it overrides them in a configurable manner for a controlled execution environment is ok.

jmchilton commented 7 years ago

Next question - CWL defines HOME to be the pwd, but if we are going to define a tmp directory for jobs I guess for Galaxy tools we'd expect not to collect files from that directory so that tmp directory might be a better default than pwd for Galaxy. The framework will still be updated to allow CWL tools to point this at HOME but it'd be better for Galaxy to point at this temp directory I'm thinking.

natefoo commented 7 years ago

I'd suggest a second empty directory like pwd/../home instead of reusing the temp directory.

nsoranzo commented 7 years ago

What about adding something like this to the template for tool_script.sh :

export TMPDIR=$(mktemp -d "${TMPDIR:-$job_dir}/tmp.XXXXXXXXX")
export TEMP=$TMPDIR
export TMP=$TMPDIR

where $job_dir obviously is substituted with the actual job directory when creating the script. In this way we respect the admin-configured TMPDIR but guarantee a new directory for every job.

jmchilton commented 7 years ago

@nsoranzo I've got issues with this proposal but I need to know if you are suggestion this for all tools or certain profiles of tools (e.g. < 18.01 or >= 18.01)?

nsoranzo commented 7 years ago

@jmchilton All tools. This won't break assumptions of old tools or admin-configured environment variables. It also satisfies the "goal of providing tools with isolated temp directories". What are the issues?

@natefoo can obtain its present configuration simply by setting TMPDIR to the empty value in job_conf.xml .

jmchilton commented 7 years ago

@nsoranzo Thanks for the clarification. One more question before the pushback - are you arguing against making it settable via a property that Galaxy can reason about ahead of time? Like <param id="tmp_dir">/path/to/tmp</param> the way that is discussed above or are you saying that if that is not set we should use your strategy as a fallback?

nsoranzo commented 7 years ago

I think that setting it with:

<env id="TMPDIR">/my/temp/dir</env>

should be enough and there is no need to introduce Galaxy-specific variables.

jmchilton commented 7 years ago

@nsoranzo I have some problems with this for sure then.

Problem 1

If I see this as the admin:

<env id="TMPDIR">/my/temp/dir</env>

I would definitely expect TMPDIR to be set to that directory and not a subdirectory of that directory in the job. All the other env variables would be passed through as is, Galaxy doesn't dispatch on them and do different things with them and Galaxy certainly doesn't actively rewrite them (to be a subdirectory in the instance). If we are going to process the parameter and use it - it should be a param not an env. For instance in the local runner I encourage <param id="local_slots"> instead of just using <env id="GALAXY_SLOTS">. I think we want the ability to set this temp directory and reason about it from within Galaxy itself, for different kinds of tools, for Pulsar, for Docker, etc.... I think it would be more clear we were doing that with a param.

There are a lot of cool things we could imagine doing with object stores, Pulsar, and Docker with regards to this variable. We could have a parameter that would have Galaxy use the object store closest to the input data to define a temp. We could have a parameter for allocating in-memory tmp on large memory systems. We could tweak cloud mounts or Docker mounts with dynamic rules based on estimations of tmp space used. In addition to admin defined Galaxy dynamic job destinations - one could imagine tools defining constraints for tmp as well. The CWL for instance provides a syntax for specifying the amount of tmp needed (http://www.commonwl.org/v1.0/CommandLineTool.html#ResourceRequirement). I don't want to just dispatch on it at runtime on the cluster - I want it in Galaxy - I want it before the job has been submitted. If it is in Galaxy, I want it as a param not an env with different semantics than all the other envs.

Problem 2

I think it in addition to it being surprising that TMPDIR is being rewritten, I think it would be surprising that TMP and TEMP are being rewritten based on that. The one advantage of your approach of my preferred approach is that existing admin configurations are less likely to be broken. But if I'm one of those admins and I have set all three - why should just one be preserved and not the others? I get that is the right one to preserve if we were going to go this route - it is just a bit surprising that a sort of raw value like an env value would work that way.

Problem 3

So I strongly feel with should have a dedicated job_conf / galaxy.ini syntax for setting up the base of temp directory as I outlined in this comment above. I feel less strongly that we shouldn't use TMPDIR as a default but instead use the jobs_directory/<job_hash>/tmp.XXXX (i.e. working_directory/../tmp.XXX). It is a less strong belief but I still think it is the correct out-of-box default.

If I knew the Galaxy admin had set up $TMPDIR I would definitely be okay with it being the default. But if it is just some system setting, I think 54.6% of the time it would be better to use something next to Galaxy's working directory by default - it is more likely to have enough space and more likely to work out of the box. I've been burned by Galaxy filling up tiny /tmp directories and such before at MSI and on cloud nodes. jobs_directory is going to be more likely to have more space allocated I think. If there are disk issues - the admin won't need to figure out if they are in /tmp or in the working directory - there is one place by default to fix these things. If other things are running on the cluster and filling up tmp Galaxy is less likely to break as well. It also has the advantage of making things easier to debug - if I turn cleanup jobs off the job's temp files won't be cleaned up. This should in theory be great for replicating Galaxy's command-lines and re-running failed jobs to see what happened.

Also - one of those potential defaults is knowable ahead of time and the other is not. So as we do more sophisticated things with job staging - it will be easier to do these awesome things with less configuration if we go with the knowable default instead of the cluster defined default.

Conclusion

A job conf setting for this that defaults to something the working directory is the thing that is going to be the easiest to debug, least likely to break after some simple modifications to Galaxy's configuration, and the easiest to understand for future admins.

I can add a setting for disabling this or make it only apply to 18.01+ configurations (if we do switch the configuration file to YAML this release) - so this doesn't have to get in the way of admins wanting to manually tweak things with env directives. It isn't a level of abstraction I want Galaxy admins to have to work at, and it certainly isn't a level of abstraction I want to have to reason about within Galaxy - but it is a level of abstraction I'd be happy to give you a flag to enable if you want to keep working that way.

nsoranzo commented 7 years ago

I don't have time now for a detailed response, but I just want to add that if the TMPDIR is set by the cluster admin on the node, this will be respected with my approach, while that is not possible with a variable set by a param in job_conf.xml .

jmchilton commented 7 years ago

I've made the case that I don't believe deferring to that should be the out of the box default but Ill provide a flag for instances that would like to do this.

nsoranzo commented 7 years ago

Question: there are some command-line tools (e.g ete3, tectonic) which cache downloaded data in hidden directories (~/.etetoolkit/, ~/.cache/Tectonic/). This is useful because:

I have two ideas which can help:

jmchilton commented 6 years ago

rename use_legacy_home to use_cached_home or something like that

Good point - this is entirely valid use case. I'll call it use_shared_home - I'll try to make it settable in galaxy's config or job_conf on a per destination basis - it should cover all the bases.