Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.23k stars 773 forks source link

@conda_base to recycle identical conda envs #1120

Closed crypdick closed 2 months ago

crypdick commented 2 years ago

Current behavior:

@conda_base generates a separate conda env directory for each (flow, dependencies) combo.

Requested behavior:

Metaflow should recycle conda envs if they are identical dependencies.

Background:

We are transitioning our mono-repo from pip-tools to Metaflow's @conda_base. We wrote an environment.yml parser such that we can decorate all our flows with @conda_base(parse_env()) and reuse the same set of dependencies across all flows.

This works for single flow runs. However, our pipeline CI tests are broken because each flow generates a separate (identical) 7GB conda env, quickly filling up drives.

savingoyal commented 2 years ago

@crypdick conda uses hardlinks to save on disk space already. Are you seeing different behavior?

crypdick commented 2 years ago

@savingoyal I repeated the du commands from that link, and indeed, the two commands for d in envs/*; do du -sh $d; done vs du -sh envs/* show different values, so it's not as bad as I thought.

du -sh envs/* pkgs lib bin conda-meta share include etc: image

However, each env is still 600 MB. If these disk usages are correct, running our pytest suite locally for 20 flows still requires an unreasonable amount of space, IMO.

savingoyal commented 2 years ago

Do you have any other conda package cache besides pkgs? I don't see any reason why two different environments will have the same size (636M) and not rely on a cache.

crypdick commented 2 years ago

Not that I'm aware of, @savingoyal . I checked /opt/ for conda/mamba cache's, didn't find anything there.

Additional info:

mamba env create -f src/environment.base.yml -n tmpenv python=3.9
mamba env export -n tmpenv >> src/environment.yml
mamba env remove --name tmpenv -y

The resulting environment.yml has pinned versions and builds, so I'm not surprised that each env has an identical size. image

update: I also poked around pkg directories vs the metaflow envs running ls -lLi to see if the files are pointing to the same inodes, and they appear to be different files on disk image image

savingoyal commented 2 years ago

@crypdick you can invoke conda info and mamba info to list all your package caches.

crypdick commented 2 years ago

Here you go @savingoyal: https://gist.github.com/crypdick/106c876a8af1f0403c8dce50b545eaef

savingoyal commented 2 months ago

@crypdick, please feel free to reopen this issue if this issue still persists