Open iirekm opened 4 years ago
Hi @iirekm,
Reusing virtualenv poses many challenges when it comes to reproducing execution. Another alternative is using the Agent's package_manager.system_site_packages
setting, in a system where most requirements are preinstalled, or using a base docker in which most requirements are installed.
What do you think?
Those settings with packages or docker image are per agent, not per run? It will be problem when I have (lets say) 2 agents and 3 projects. I would have to restart my agents with new settings when switching projects?
I think caching envs would be easy, just keep many envs in different directories, names of which could end with hash of setup.py or requirements.txt
plus: even worse thing happens when I schedule a task to the default services
queue:
ca-certificates git-man krb5-locales less libasn1-8-heimdal libbsd0
libcurl3-gnutls libedit2 liberror-perl libexpat1 libgdbm-compat4 libgdbm5
libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal libhcrypto4-heimdal
libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libice6
libicu60 libk5crypto3 libkeyutils1 libkrb5-26-heimdal libkrb5-3
libkrb5support0 libldap-2.4-2 libldap-common libnghttp2-14 libperl5.26
libpsl5 libpthread-stubs0-dev libroken18-heimdal librtmp1 libsasl2-2
libsasl2-modules libsasl2-modules-db libsqlite3-0 libssl1.0.0 libssl1.1
libwind0-heimdal libx11-6 libx11-data libx11-dev libx11-doc libxau-dev
libxau6 libxcb1 libxcb1-dev libxdmcp-dev libxdmcp6 libxml2 libxmuu1
libxrender1 multiarch-support netbase openssh-client openssl patch perl
perl-base perl-modules-5.26 publicsuffix shared-mime-info x11-common
x11proto-core-dev x11proto-dev xauth xdg-user-dirs xorg-sgml-doctools
xtrans-dev
Suggested packages:
gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-el git-email
git-gui gitk gitweb git-cvs git-mediawiki git-svn gdbm-l10n krb5-doc
krb5-user libsasl2-modules-gssapi-mit | libsasl2-modules-gssapi-heimdal
libsasl2-modules-ldap libsasl2-modules-otp libsasl2-modules-sql libxcb-doc
keychain libpam-ssh monkeysphere ssh-askpass ed diffutils-doc perl-doc
libterm-readline-gnu-perl | libterm-readline-perl-perl make
The following NEW packages will be installed:
ca-certificates git git-man krb5-locales less libasn1-8-heimdal libbsd0
libcurl3-gnutls libedit2 liberror-perl libexpat1 libgdbm-compat4 libgdbm5
libglib2.0-0 libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal
libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal
libhx509-5-heimdal libice6 libicu60 libk5crypto3 libkeyutils1
libkrb5-26-heimdal libkrb5-3 libkrb5support0 libldap-2.4-2 libldap-common
libnghttp2-14 libperl5.26 libpsl5 libpthread-stubs0-dev libroken18-heimdal
librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db libsm6 libsqlite3-0
libssl1.0.0 libssl1.1 libwind0-heimdal libx11-6 libx11-data libx11-dev
libx11-doc libxau-dev libxau6 libxcb1 libxcb1-dev libxdmcp-dev libxdmcp6
libxext6 libxml2 libxmuu1 libxrender-dev libxrender1 multiarch-support
netbase openssh-client openssl patch perl perl-modules-5.26 publicsuffix
shared-mime-info x11-common x11proto-core-dev x11proto-dev xauth
If you run this inside Docker and need so many packages, the answer is obvious: create and push docker image to dockerhub.
Hi @iirekm
Some background before I address the issue.
trains-agent
can operate in two modes:
Answering a few of the comments you raised:
If you run this inside Docker and need so many packages, the answer is obvious: create and push docker image to dockerhub.
Yes definitely do that, it will save a lot of time. Plus, with the way trains-agent
works, if you do need to change a specific package version or install a new one, you do not have to immediately build a new docker image, i.e. instead of a must, building a new docker becomes a choice, which is always preferable :)
I think caching envs would be easy, just keep many envs in different directories, names of which could end with hash of setup.py or requirements.txt
Yes this is exactly what we are planning to do, venv caching is in the making. As you obviously witnessed sometimes installing a large set of packages with pip takes very long time (actually we PRed a draft for parallelizing the pip install process and accelerating it :) Anyhow, at least all the packages are cached so the next time the only penalty is unzipping (i.e. extracting the wheels), which still might take too long, hence venv caching
BTW: the services queue will always run in docker mode, with a default ubunut-18.04 docker image, hence the "apt packages" list, obviously this can be configured.
Both venv and Docker mode with the current implementation still seem too slow. A lot of progress compared with eg SageMaker where for each task a new instance is created, but still there's huge area for improvements.
For now, my workaround for venv caching is:
import subprocess
import sys
from hashlib import sha256
from os import getenv, environ
from pathlib import Path
if getenv("TRAINS_WORKER_ID") is not None and getenv("__TRAINS_FAST_VIRTUALENV") is None:
project_dir = Path(__file__).parent.parent.parent # TODO
with open("setup.py", "rb") as f:
hash = sha256(f.read()).hexdigest()
venv_dir = Path(f"{Path.home()}/.trains/__venv-cache/{(getenv('TRAINS_WORKER_ID') or '-').replace(':', '-')}/{hash}")
python_exe = f"{venv_dir}/bin/python"
fixed_env = env = {**environ, "PYTHONPATH": "", "__TRAINS_FAST_VIRTUALENV": "true", "PATH": "/usr/local/bin:/usr/bin:/bin"}
if not venv_dir.is_dir():
# reset env to deactivate the uncached trains venv
subprocess.run(["virtualenv", "--python=py37", venv_dir], env=fixed_env, check=True)
subprocess.run([python_exe, "-m", "pip", "install", "-e", "."], cwd=project_dir, env=fixed_env, check=True)
subprocess.run([python_exe] + sys.argv, env=fixed_env, check=True)
exit()
def trains_pip_install():
"""
Usage: at to of your file import and call this method.
"""
pass # nothing to do; just for IntelliJ to not complain about unused import and to avoid import reordering
(reduces task startup time from ~2 minutes to about 20 seconds)
@iirekm that is awesome!!!
I was thinking of hashing these sections:
requirements
coming from here together with task.execution.docker_cmd
(because we need to match the venv with the docker image it is running in), then just restore/copy the venv folder based on the hashed folder.
WDYT?
the code is too big to understand it quickly :-)
for me anything that is fast will do; also note that people use different dependency mechanisms: requirements.txt, setup.py or Anaconda's yamls. I guess all of them have to be supported , and virtualenv reuse mechanism has to be aware of changes in those files.
They are supported by the system, and all of them are at the end conformed to a "requirements" dict.
You mentioned "requirements.txt", from the trains-agent
setup perspective this is a fallback if no "installed packages" is empty.
Also notice that once the venv is ready, it will update back the "installed packages" to the full pip freeze
of the newly created venv, this is the state I'm suggesting we cache
Hi,
I'm updating here that the latest version of clearml-agent
now includes venv caching capabilities:
Add this section to your ~/clearml.conf
file on the agent's machine
agent {
# cached virtual environment folder
venvs_cache: {
# maximum number of cached venvs
max_entries: 10
# minimum required free space to allow for cache entry, disable by passing 0 or negative value
free_space_threshold_gb: 2.0
# unmark to enable virtual environment caching
path: ~/.clearml/venvs-cache
},
}
Reference here: https://github.com/allegroai/clearml-agent/blob/22d5892b12efa2acde304658ad0f08594b3e4ce6/docs/clearml.conf#L93
And upgrade and restart the clearml-agent
pip install clearml-agent==0.17.2rc2
Currently, Trains seems to recreate/reset virtualenv before every task run, which means that packages are reinstalled again and again. On huge GPU instances this is a problem: every minute costs, so the quicker my task starts, the better.