anaconda / anaconda-project

Tool for encapsulating, running, and reproducing data science projects
https://anaconda-project.readthedocs.io/en/latest/
Other
221 stars 88 forks source link

read-only environments #270

Closed AlbertDeFusco closed 3 years ago

AlbertDeFusco commented 4 years ago

what if a pre-baked env needs to be modified but it read-only on disk?

AlbertDeFusco commented 4 years ago

@mcg1969 , I've been running some experiments on this and I have developed two approaches that I will explain here using conda commands. I believe either one can be incorporated into anaconda-project. A comparison between these two approaches is given at the end.

These two solutions produce identical environments and have nearly identical times for this example.

Setup

Let's create a read-only conda environment.

# create and unpin history
conda create -y --offline -n readme python=3.6 anaconda=5.0.1
cat /dev/null > ~/Applications/miniconda3/envs/readme/conda-meta/history

# set as readonly
chmod -R 555 ~/Applications/miniconda3/envs/readme

Clone

One approach is to utilize conda create --clone only when anaconda-project detects that a change has been requested to the read-only env and then continue to apply package changes. Otherwise, anaconda-project commands like run will work directly on the read-only env.

conda create -p ./envs/clone --clone ~/Applications/miniconda3/envs/readme
cat /dev/null > ./envs/clone/conda-meta/history
conda install --offline -y -p ./envs/clone pandas=1 hvplot

Rebuild spec and create

An alternative is, again only when a change to the env is requested, to perform a dry-run install of the requested packages and rebuild an environment spec file using the original package list from the read-only env.

First prepare JSON files for 1) the original package list and 2) the changes required (both add and remove).

conda list --json -n readme > readme.json
conda install --offline -n readme --dry-run --json pandas=1 hvplot > update.json

Now we need to reconstruct the the environment using these two JSON files.

import json
import sys

readonly_json = sys.argv[1]
dryrun_json = sys.argv[2]

def pkg_version_build(d):
    return f"{d['name']}={d['version']}={d['build_string']}"

with open(readonly_json) as f:
    readonly = json.load(f)

with open(dryrun_json) as f:
    dryrun = json.load(f)

original = set(map(pkg_version_build, readonly))

to_remove = set(map(pkg_version_build, dryrun['actions']['UNLINK']))
to_add = set(map(pkg_version_build, dryrun['actions']['LINK']))

final = (original - to_remove) | to_add

print('dependencies:')
print('\n'.join((f'  - {p}' for p in final)))

And now create the local environment

python rebuild_spec.py readme.json update.json > local.yml
conda env create -f local.yml -p ./envs/local

comparison

Method Time (m) notes
Clone 1.13 There may be unintended consequences with clone (does it work with pip?).
Rebuild 1.1 Requires two solves, however the second one should be much faster. The time to solution could be slower if the original package cache from the read-only env is not maintained.
mcg1969 commented 4 years ago

That's actually not bad. Thanks for being data driven.

The package cache requirement is the same for both—cloning actually requires repopulating the package cache.

mcg1969 commented 4 years ago

So the advantage for the Rebuild approach is that it reduces the amount of package downloads. With a clone and install, you'll repopulate the package cache with old packages that don't end up in the new environment.

AlbertDeFusco commented 3 years ago

With #292 merged I believe we can support this directly in the path function. The find_environment_deviations can help.

https://github.com/Anaconda-Platform/anaconda-project/blob/84f20e77af7bdac53b9948209c8fbfd83bedfe41/anaconda_project/conda_manager.py#L88

What I'm working on in the path function is that

  1. search for first matching env_spec in $ANACONDA_PROJECT_ENVS_PATH directories
  2. if no changes are required set this path and continue with the operation
  3. if changes are required continue searching $ANACONDA_PROJECT_ENVS_PATH directories for a writable env_spec
  4. if no pre-made env_spec found create new env in first writable $ANACONDA_PROJECT_ENVS_PATH path

So this would mean that you run anaconda-project as follows

export ANACONDA_PROJECT_ENVS_PATH=/path/to/readonly/envs:/path/to/writable/envs:
anaconda-project prepare

Does this match what you want to do?

mcg1969 commented 3 years ago

Yes, I think that sounds right! Except that I think you have to put the writable environment directories first in priority, not last.

After all, suppose you do anaconda-project prepare and you determine that changes need to be made to a read-only environment. So you create a new environment to host the changes.

But the way you've ordered the PATH, the next time you do anaconda-project prepare it won't see that read-write environment.

AlbertDeFusco commented 3 years ago

looks done to me.