Closed jcrist closed 7 years ago
Agreed, good idea.
The tricky thing here is that the zipfile is assumed to contain an environment directory, since the line in the client env("PYTHON_BIN") = s"./PYTHON_DIR/$envName/bin/python"
uses that name. The extra directory level isn't particularly useful, there will only ever be one. Not certain which way to go on this.
I'd be for nixing the extra directory level, as you're right, there should only be one.
In the meantime, I should probably remove all the string processing in scala - we'll pass in upload file (or none), hdfs location, and env variables for python. Note that the patch creation from env name, and setting of environments happens in two places in scala, both client and app-master.
Oh, wait, I misunderstood you a bit here, thought the zipped environment contained the PYTHON_DIR/$envname/
bit.
For the record, the current behavior is:
Given a zip file named my_env.zip
, this must unzip to a single directory my_env/
, which is a conda environment. PYTHON_BIN
is then set to ./PYTHON_DIR/my_env/bin/python
.
Did you mean that the environment should unzip to many directories (e.g. bin
, lib
, etc...), and that PYTHON_BIN
would be ./PYTHON_DIR/bin/python
? In that case I'm not sure if that's a good idea. On the one hand, this would keep commands between environments the same, meaning the path to python would always be ./PYTHON_DIR/bin/python
. On the other hand, this adds many things into the PYTHON_DIR
directory.
An alternative solution would be to keep with the current convention, and take the basename of the provided hdfs path to be the environment name (same as is done on the local system). Everything continues to work the same then.
Of these two options I think I like the existing behavior better. Given a zipped conda environment named foo.zip
I'd expect it to unzip to a directory foo/
which contains all the environment folders. However I can see arguments for the second, as it makes the folder structure consistent (no need to worry about the name of the environment). This could be supported in the existing design by just renaming the folder after unzipping. Something like:
$ unzip my_env.zip
$ mv my_env CONDA_ENVIRONMENT # or some other consistent path
Yes, I was thinking of having the python executable at './PYTHON_DIR/bin/python'
.
Can we assume really that myenv.zip contains myenv/, and python lives within there? I suppose if we say so loudly in the doc, but we can't check this at run time. This is by far easier to implement given the current code, I'm not sure it's cleaner. Another possibility would be to pass the subdirectory name as yet another parameter...
Can we assume really that myenv.zip contains myenv/, and python lives within there?
That's how the current code works, and it doesn't seem to be a problem currently. I'd be -1 on adding yet-another-parameter, so whatever the solution is shouldn't require that.
Another alternative would be to ditch the whole concept of conda environments and just have knit deal with
Something like:
knit = knit.Knit()
knit.start(cmd, ..., archive=path_to_zip) # or some better kwarg
where path_to_zip
is either a local path or a path on hdfs. On each node all the worker does is:
cmd
in that directoryI think this is my favorite solution. No need for conda specific stuff, and what happens with the archive file seems clearer.
Could also name the kwarg resources
and have it take a list of paths. Then the files
kwarg could be removed. Execution would then be:
Knit does allow you to run any bash command; and you can also upload files. There is no extension checking, so those files are assumed to be unzipped, but there could be. It would be easy enough to extend that to zip archives (I don't know what formats YARN would support). YARN does unzipping by itself, and we of course don't know what they contain beforehand; we could also run unzip via python or bash. Lots of possible modes.
Knit does allow you to run any bash command; and you can also upload files.
I know that, the comment above was just an effort to reconcile the various options into a simpler (in my mind) and smaller set. Currently env
is conda specific, but it doesn't need to be. A general list of (zip)files and a single command seems simpler. Conda specific stuff could be built on top of that, but the core isn't coupled to it.
Anyway, just brainstorming here. I only know my needs/wants, others may differ.
Question: if we have a list of resources ['directory', 'file', 'file.zip', 'hdfs://file', 'hdfs://file.zip']
, it's obvious which to zip up, which are already archives, and which are already on HDFS. Anything to be uploaded will go into .knitDeps/file where file is the basename. So, we need a separate variable to define the /bin directory where python and dask-worker executables are to be found. As we send the full command string, this can be done directly on the python side rather than setting PYTHON_BIN and conda env-variables, or we could just add the right dir to the container PATH.
Any preferences?
Hmmmm, in my mind knit
shouldn't do anything python-specific - it's just running a bash command on a container given a set of resources. Two ideas:
knit.start(cmd, resources=['directory', 'file',...], environ={'PYTHON': 'directory/bin/python',...})
I'm not sure which is better though.
Fixed by #101
Currently if you specify an environment it must exist locally as a
.zip
file, and is always stored in hdfs by knit in a.knitDeps
repo. If a file of the same name/path already exists on hdfs, has the same size, and the timestamp is after the local file, then uploading of the environment is skipped.This is problematic for us for a few reasons:
As such, it would be nice if
env
could also be specied as an hdfs path. This could be detected by looking for a'hdfs://'
prefix on the specified path.