anaconda / anaconda-project

Tool for encapsulating, running, and reproducing data science projects
https://anaconda-project.readthedocs.io/en/latest/
Other
217 stars 88 forks source link

pack envs into the archive #313

Closed AlbertDeFusco closed 3 years ago

AlbertDeFusco commented 3 years ago
usage: anaconda-project archive [-h] [--directory PROJECT_DIR] [--pack-envs]
                                ARCHIVE_FILENAME

positional arguments:
  ARCHIVE_FILENAME

optional arguments:
  -h, --help            show this help message and exit
  --directory PROJECT_DIR
                        Project directory containing anaconda-project.yml
                        (defaults to current directory)
  --pack-envs           Package environments into the archive.

With the --pack-envs flag all live env_specs will be packed with conda-pack and added to the output archive.

When the archive is extracted using anaconda-project unarchive the conda-pack envs are also extracted and the relocatable paths are fixed.

AlbertDeFusco commented 3 years ago

@bkreider , @jbednar , @mcg1969 would anyone care to review this?

jlstevens commented 3 years ago

Looks great!

To test, I decided to try out this small project here.

When there is no envs directory, I note that --pack-envs doesn't complain (wondering if it should at least warn):

image

But after running anaconda-project prepare you can see the env being packed:

image

This suggests that envs is packed 'as is' (if present) which does make sense as a default. That said, for reproducibility (from the anaconda-project.yml specification) it might also make sense to have the option to regenerate envs (i.e rerun anaconda-project prepare) before archiving the result?

After that I simply tried unpacking test2.tar.gz expecting to see the project with the envs directory, but found osx-64_envs_default.tar.bz2 instead. That is when I came back to this PR to see that I had to use anaconda-project unarchive test2.tar.gz after which it worked as expected.

You wrote that:

When the archive is extracted using anaconda-project unarchive the conda-pack envs are also extracted and the relocatable paths are fixed.

So I am assuming osx-64_envs_default.tar.bz2 is really the result of conda-pack which the unarchive command handles, explaining why I can't just use OSX to extract the archive (and immediately start using it).

Is there a reason this extra step of indirection couldn't be avoided, removing the need for the unarchive command? I am wondering what relocatable paths need adjusting as I was under the impression that the envs directory was already relocatable?

jbednar commented 3 years ago

This suggests that envs is packed 'as is' (if present) which does make sense though for reproducibility it might also make sense to have the option to regenerate envs (i.e rerun anaconda-project prepare) before archiving the result?

That's surprising to me; I would have expected that the environment would first be generated if not already present, and updated if already present, the same as currently happens for anaconda-project run. That way we can be sure it's archiving a coherent and consistent project.

I too am surprised that unpacking the archive requires a special command. Maybe instead anaconda-project can look for the packed environment and unpack it if it finds it, before doing anything else? That way it will act as if the environment has been unpacked already. I think most people expect to be able to unpack the archive manually without any consequences, if only because often that's the only way to get to a README that tells them to use anaconda-project in the first place.

mcg1969 commented 3 years ago

conda-pack archives cannot simply be unpacked with standard OS tools. There is relocation logic that must be executed in order for the environment to be fully functional.

mcg1969 commented 3 years ago

(To be fair, using a standard OS tool followed by running envs/<envname>/bin/conda_unpack will do the trick. You cannot omit that second step.)

AlbertDeFusco commented 3 years ago

How would this sound?

  1. Add anaconda-project archive --pack-envs --rebuild-envs, which will run anaconda-project clean, anaconda-project prepare --all before building the archive
  2. Move the conda-unpack logic from unarchive into run.

How strongly are you expecting extracting a project archive should create envs/ without needing to run an anaconda-project command?

The reason for having this feature in unarchive is to detect when the platform on which the project was archived does not match the platform where it was extracted.

AlbertDeFusco commented 3 years ago

Yes, running the conda-unpack script is necessary to fix the relocation paths and can be run manually

jbednar commented 3 years ago

--rebuild-envs sounds good, but as an anaconda-project user I've been accustomed to expecting envs to be rebuilt automatically if needed, without an explicit command, and finding that a certain command like --pack-envs behaves otherwise seems confusing and surprising. So from the behavior for other options, I'd expect an option for not rebuilding the environment, not one that's required for rebuilding it.

The reason for having this feature in unarchive is to detect when the platform on which the project was archived does not match the platform where it was extracted.

I'd assume that can be detected regardless, e.g. by storing information somewhere about which platform was archived.

mcg1969 commented 3 years ago

If we want to entertain the idea of moving the conda-unpack execution into the prepare step, then I would want us to have some sort of indicator file (e.g., $PREFIX/.conda-packed) that we would insert into the environment during the packing process, so that we know to run conda-unpack only once. Or maybe we remove or zero out conda-unpack once it has been run once.

AlbertDeFusco commented 3 years ago

Ok, so here's how I understand the proposed changes. Have I got it right?

  1. Enforce the prepare --all action when running archive --pack-envs
  2. Build only one archive, not nested archives, that populate the envs/ directory and can be extracted with os-native tools
    • and add a file to indicate the architecture on which the envs were packed
  3. prepare will detect the presence of conda-unpack in the envs/ directory and the architecture file to determine that this was created using --pack-envs
    • if the conda-unpack script is present determine if the architecture matches
    • if the env is not compatible prepare will remove the envs directory and rebuild them from scratch
    • if the env is compatible run conda-unpack and remove it or set it to return 0 for the next time prepare is attempted

I'll get working on these changes

jbednar commented 3 years ago

That sounds perfect! It seems like that will make a packed archive act just like a regular one, but without having to fetch packages and build the environment. Great!

jlstevens commented 3 years ago

Just to report that I tested this feature with conda install -c defusco/label/dev anaconda-project=0.9.1+91. To do this, I chose this project randomly from the pyviz example projects.

Running anaconda-project archive /tmp/archive.tar.gz --pack-envs worked as expected. It does take a while to pack but the progress bars really help indicate that things were advancing. Note that I ran this without the local envs directory and the solve was triggered as expected.

After extracting the archive, anaconda-project run notebook immediately started the notebook server with the notebook immediately runnable. I think if --pack-envs were quicker and the archives a lot smaller, I would use this option all the time: in short, I think the usability is now greatly improved and works exactly as I would have hoped!

mcg1969 commented 3 years ago

This is fantastic.

jlstevens commented 3 years ago

Very excited to see this merged!

One thing that I wanted to know was the archive size for some simple cases to get a baseline:

It would be interesting to find out what the smallest archive size can be that offers a basic data science environment.

AlbertDeFusco commented 3 years ago

One thing I'll look into a bit more is ignoring pyc or other files and static libraries

AlbertDeFusco commented 3 years ago

It is surprising that nomkl leads to that large of an increase!