Closed crypdick closed 2 months ago
@crypdick conda uses hardlinks to save on disk space already. Are you seeing different behavior?
@savingoyal I repeated the du
commands from that link, and indeed, the two commands for d in envs/*; do du -sh $d; done
vs du -sh envs/*
show different values, so it's not as bad as I thought.
du -sh envs/* pkgs lib bin conda-meta share include etc
:
However, each env is still 600 MB. If these disk usages are correct, running our pytest suite locally for 20 flows still requires an unreasonable amount of space, IMO.
Do you have any other conda package cache besides pkgs
? I don't see any reason why two different environments will have the same size (636M) and not rely on a cache.
Not that I'm aware of, @savingoyal . I checked /opt/
for conda/mamba cache's, didn't find anything there.
Additional info:
mamba
solver (cmds simplified for brevity):mamba env create -f src/environment.base.yml -n tmpenv python=3.9
mamba env export -n tmpenv >> src/environment.yml
mamba env remove --name tmpenv -y
The resulting environment.yml
has pinned versions and builds, so I'm not surprised that each env has an identical size.
update:
I also poked around pkg directories vs the metaflow envs running ls -lLi
to see if the files are pointing to the same inodes, and they appear to be different files on disk
@crypdick you can invoke conda info
and mamba info
to list all your package caches.
Here you go @savingoyal: https://gist.github.com/crypdick/106c876a8af1f0403c8dce50b545eaef
@crypdick, please feel free to reopen this issue if this issue still persists
Current behavior:
@conda_base generates a separate conda env directory for each (flow, dependencies) combo.
Requested behavior:
Metaflow should recycle conda envs if they are identical dependencies.
Background:
We are transitioning our mono-repo from pip-tools to Metaflow's @conda_base. We wrote an environment.yml parser such that we can decorate all our flows with
@conda_base(parse_env())
and reuse the same set of dependencies across all flows.This works for single flow runs. However, our pipeline CI tests are broken because each flow generates a separate (identical) 7GB conda env, quickly filling up drives.