Specify conda environment via hdfs path

jcrist commented 7 years ago

Currently if you specify an environment it must exist locally as a .zip file, and is always stored in hdfs by knit in a .knitDeps repo. If a file of the same name/path already exists on hdfs, has the same size, and the timestamp is after the local file, then uploading of the environment is skipped.

This is problematic for us for a few reasons:

Managing local zipped conda environments is a delicate process
We'd like to store them in our own central location, with permissions managed across a few users
We'd like the code not to need a local build of the environment.
We'd like to manage environments separate from managing deployments. These are related concepts, but tying them together in that way is limiting.

As such, it would be nice if env could also be specied as an hdfs path. This could be detected by looking for a 'hdfs://' prefix on the specified path.

martindurant commented 7 years ago

Agreed, good idea.

martindurant commented 7 years ago

The tricky thing here is that the zipfile is assumed to contain an environment directory, since the line in the client env("PYTHON_BIN") = s"./PYTHON_DIR/$envName/bin/python" uses that name. The extra directory level isn't particularly useful, there will only ever be one. Not certain which way to go on this.

jcrist commented 7 years ago

I'd be for nixing the extra directory level, as you're right, there should only be one.

martindurant commented 7 years ago

In the meantime, I should probably remove all the string processing in scala - we'll pass in upload file (or none), hdfs location, and env variables for python. Note that the patch creation from env name, and setting of environments happens in two places in scala, both client and app-master.

jcrist commented 7 years ago

Oh, wait, I misunderstood you a bit here, thought the zipped environment contained the PYTHON_DIR/$envname/ bit.

For the record, the current behavior is:

Given a zip file named my_env.zip, this must unzip to a single directory my_env/, which is a conda environment. PYTHON_BIN is then set to ./PYTHON_DIR/my_env/bin/python.

Did you mean that the environment should unzip to many directories (e.g. bin, lib, etc...), and that PYTHON_BIN would be ./PYTHON_DIR/bin/python? In that case I'm not sure if that's a good idea. On the one hand, this would keep commands between environments the same, meaning the path to python would always be ./PYTHON_DIR/bin/python. On the other hand, this adds many things into the PYTHON_DIR directory.

An alternative solution would be to keep with the current convention, and take the basename of the provided hdfs path to be the environment name (same as is done on the local system). Everything continues to work the same then.

Of these two options I think I like the existing behavior better. Given a zipped conda environment named foo.zip I'd expect it to unzip to a directory foo/ which contains all the environment folders. However I can see arguments for the second, as it makes the folder structure consistent (no need to worry about the name of the environment). This could be supported in the existing design by just renaming the folder after unzipping. Something like:

$ unzip my_env.zip
$ mv my_env CONDA_ENVIRONMENT  # or some other consistent path

martindurant commented 7 years ago

Yes, I was thinking of having the python executable at './PYTHON_DIR/bin/python'.

Can we assume really that myenv.zip contains myenv/, and python lives within there? I suppose if we say so loudly in the doc, but we can't check this at run time. This is by far easier to implement given the current code, I'm not sure it's cleaner. Another possibility would be to pass the subdirectory name as yet another parameter...

jcrist commented 7 years ago

Can we assume really that myenv.zip contains myenv/, and python lives within there?

That's how the current code works, and it doesn't seem to be a problem currently. I'd be -1 on adding yet-another-parameter, so whatever the solution is shouldn't require that.

Another alternative would be to ditch the whole concept of conda environments and just have knit deal with

running a bash command
possibly unzipping a folder in the top-level environment before running that command

Something like:

knit = knit.Knit()
knit.start(cmd, ..., archive=path_to_zip)  # or some better kwarg

where path_to_zip is either a local path or a path on hdfs. On each node all the worker does is:

download the archive
unzip it in a working directory (or untar, etc... if we support other formats)
execute cmd in that directory

I think this is my favorite solution. No need for conda specific stuff, and what happens with the archive file seems clearer.

jcrist commented 7 years ago

Could also name the kwarg resources and have it take a list of paths. Then the files kwarg could be removed. Execution would then be:

Upload each resource to hdfs if not already an hdfs path
Download each resource to the workers in the order provided, unzipping any zipped files
Run the command on each worker

martindurant commented 7 years ago

Knit does allow you to run any bash command; and you can also upload files. There is no extension checking, so those files are assumed to be unzipped, but there could be. It would be easy enough to extend that to zip archives (I don't know what formats YARN would support). YARN does unzipping by itself, and we of course don't know what they contain beforehand; we could also run unzip via python or bash. Lots of possible modes.

jcrist commented 7 years ago

Knit does allow you to run any bash command; and you can also upload files.

I know that, the comment above was just an effort to reconcile the various options into a simpler (in my mind) and smaller set. Currently env is conda specific, but it doesn't need to be. A general list of (zip)files and a single command seems simpler. Conda specific stuff could be built on top of that, but the core isn't coupled to it.

Anyway, just brainstorming here. I only know my needs/wants, others may differ.

martindurant commented 7 years ago

Question: if we have a list of resources ['directory', 'file', 'file.zip', 'hdfs://file', 'hdfs://file.zip'], it's obvious which to zip up, which are already archives, and which are already on HDFS. Anything to be uploaded will go into .knitDeps/file where file is the basename. So, we need a separate variable to define the /bin directory where python and dask-worker executables are to be found. As we send the full command string, this can be done directly on the python side rather than setting PYTHON_BIN and conda env-variables, or we could just add the right dir to the container PATH. Any preferences?

jcrist commented 7 years ago

Hmmmm, in my mind knit shouldn't do anything python-specific - it's just running a bash command on a container given a set of resources. Two ideas:

Just pass the command string as you suggested
Also allow passing a mapping of environment variables to set. Then the user of knit could fully define their environment:

knit.start(cmd, resources=['directory', 'file',...], environ={'PYTHON': 'directory/bin/python',...})

I'm not sure which is better though.

martindurant commented 7 years ago

Fixed by #101

dask / knit

Specify conda environment via hdfs path #100