allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.57k stars 644 forks source link

Reuse virtualenvs #242

Open iirekm opened 3 years ago

iirekm commented 3 years ago

Currently, Trains seems to recreate/reset virtualenv before every task run, which means that packages are reinstalled again and again. On huge GPU instances this is a problem: every minute costs, so the quicker my task starts, the better.

jkhenning commented 3 years ago

Hi @iirekm,

Reusing virtualenv poses many challenges when it comes to reproducing execution. Another alternative is using the Agent's package_manager.system_site_packages setting, in a system where most requirements are preinstalled, or using a base docker in which most requirements are installed. What do you think?

iirekm commented 3 years ago

Those settings with packages or docker image are per agent, not per run? It will be problem when I have (lets say) 2 agents and 3 projects. I would have to restart my agents with new settings when switching projects?

I think caching envs would be easy, just keep many envs in different directories, names of which could end with hash of setup.py or requirements.txt

iirekm commented 3 years ago

plus: even worse thing happens when I schedule a task to the default services queue:

ca-certificates git-man krb5-locales less libasn1-8-heimdal libbsd0
libcurl3-gnutls libedit2 liberror-perl libexpat1 libgdbm-compat4 libgdbm5
libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal libhcrypto4-heimdal
libheimbase1-heimdal libheimntlm0-heimdal libhx509-5-heimdal libice6
libicu60 libk5crypto3 libkeyutils1 libkrb5-26-heimdal libkrb5-3
libkrb5support0 libldap-2.4-2 libldap-common libnghttp2-14 libperl5.26
libpsl5 libpthread-stubs0-dev libroken18-heimdal librtmp1 libsasl2-2
libsasl2-modules libsasl2-modules-db libsqlite3-0 libssl1.0.0 libssl1.1
libwind0-heimdal libx11-6 libx11-data libx11-dev libx11-doc libxau-dev
libxau6 libxcb1 libxcb1-dev libxdmcp-dev libxdmcp6 libxml2 libxmuu1
libxrender1 multiarch-support netbase openssh-client openssl patch perl
perl-base perl-modules-5.26 publicsuffix shared-mime-info x11-common
x11proto-core-dev x11proto-dev xauth xdg-user-dirs xorg-sgml-doctools
xtrans-dev
Suggested packages:
gettext-base git-daemon-run | git-daemon-sysvinit git-doc git-el git-email
git-gui gitk gitweb git-cvs git-mediawiki git-svn gdbm-l10n krb5-doc
krb5-user libsasl2-modules-gssapi-mit | libsasl2-modules-gssapi-heimdal
libsasl2-modules-ldap libsasl2-modules-otp libsasl2-modules-sql libxcb-doc
keychain libpam-ssh monkeysphere ssh-askpass ed diffutils-doc perl-doc
libterm-readline-gnu-perl | libterm-readline-perl-perl make
The following NEW packages will be installed:
ca-certificates git git-man krb5-locales less libasn1-8-heimdal libbsd0
libcurl3-gnutls libedit2 liberror-perl libexpat1 libgdbm-compat4 libgdbm5
libglib2.0-0 libglib2.0-data libgssapi-krb5-2 libgssapi3-heimdal
libhcrypto4-heimdal libheimbase1-heimdal libheimntlm0-heimdal
libhx509-5-heimdal libice6 libicu60 libk5crypto3 libkeyutils1
libkrb5-26-heimdal libkrb5-3 libkrb5support0 libldap-2.4-2 libldap-common
libnghttp2-14 libperl5.26 libpsl5 libpthread-stubs0-dev libroken18-heimdal
librtmp1 libsasl2-2 libsasl2-modules libsasl2-modules-db libsm6 libsqlite3-0
libssl1.0.0 libssl1.1 libwind0-heimdal libx11-6 libx11-data libx11-dev
libx11-doc libxau-dev libxau6 libxcb1 libxcb1-dev libxdmcp-dev libxdmcp6
libxext6 libxml2 libxmuu1 libxrender-dev libxrender1 multiarch-support
netbase openssh-client openssl patch perl perl-modules-5.26 publicsuffix
shared-mime-info x11-common x11proto-core-dev x11proto-dev xauth

If you run this inside Docker and need so many packages, the answer is obvious: create and push docker image to dockerhub.

bmartinn commented 3 years ago

Hi @iirekm

Some background before I address the issue. trains-agent can operate in two modes:

  1. venv mode, where for every job a new venv is created, and all the packages are installed into it (can also be configured to inherit the venv state from the system packages)
  2. Docker mode, where a docker is spun (based on the Tasks "base docker image" section or a default docker), then inside the docker a new venv is created, inherting all of the dockers system packages, and then the "installed packages" section is being installed inside the docker.

Answering a few of the comments you raised:

If you run this inside Docker and need so many packages, the answer is obvious: create and push docker image to dockerhub.

Yes definitely do that, it will save a lot of time. Plus, with the way trains-agent works, if you do need to change a specific package version or install a new one, you do not have to immediately build a new docker image, i.e. instead of a must, building a new docker becomes a choice, which is always preferable :)

I think caching envs would be easy, just keep many envs in different directories, names of which could end with hash of setup.py or requirements.txt

Yes this is exactly what we are planning to do, venv caching is in the making. As you obviously witnessed sometimes installing a large set of packages with pip takes very long time (actually we PRed a draft for parallelizing the pip install process and accelerating it :) Anyhow, at least all the packages are cached so the next time the only penalty is unzipping (i.e. extracting the wheels), which still might take too long, hence venv caching

BTW: the services queue will always run in docker mode, with a default ubunut-18.04 docker image, hence the "apt packages" list, obviously this can be configured.

iirekm commented 3 years ago

Both venv and Docker mode with the current implementation still seem too slow. A lot of progress compared with eg SageMaker where for each task a new instance is created, but still there's huge area for improvements.

For now, my workaround for venv caching is:

import subprocess
import sys
from hashlib import sha256
from os import getenv, environ
from pathlib import Path

if getenv("TRAINS_WORKER_ID") is not None and getenv("__TRAINS_FAST_VIRTUALENV") is None:
    project_dir = Path(__file__).parent.parent.parent  # TODO
    with open("setup.py", "rb") as f:
        hash = sha256(f.read()).hexdigest()
    venv_dir = Path(f"{Path.home()}/.trains/__venv-cache/{(getenv('TRAINS_WORKER_ID') or '-').replace(':', '-')}/{hash}")
    python_exe = f"{venv_dir}/bin/python"
    fixed_env = env = {**environ, "PYTHONPATH": "", "__TRAINS_FAST_VIRTUALENV": "true", "PATH": "/usr/local/bin:/usr/bin:/bin"}
    if not venv_dir.is_dir():
        # reset env to deactivate the uncached trains venv
        subprocess.run(["virtualenv", "--python=py37", venv_dir], env=fixed_env, check=True)
        subprocess.run([python_exe, "-m", "pip", "install", "-e", "."], cwd=project_dir, env=fixed_env, check=True)
    subprocess.run([python_exe] + sys.argv, env=fixed_env, check=True)
    exit()

def trains_pip_install():
    """
    Usage: at to of your file import and call this method.
    """
    pass  # nothing to do; just for IntelliJ to not complain about unused import and to avoid import reordering

(reduces task startup time from ~2 minutes to about 20 seconds)

bmartinn commented 3 years ago

@iirekm that is awesome!!! I was thinking of hashing these sections: requirements coming from here together with task.execution.docker_cmd (because we need to match the venv with the docker image it is running in), then just restore/copy the venv folder based on the hashed folder. WDYT?

iirekm commented 3 years ago

the code is too big to understand it quickly :-)

for me anything that is fast will do; also note that people use different dependency mechanisms: requirements.txt, setup.py or Anaconda's yamls. I guess all of them have to be supported , and virtualenv reuse mechanism has to be aware of changes in those files.

bmartinn commented 3 years ago

They are supported by the system, and all of them are at the end conformed to a "requirements" dict. You mentioned "requirements.txt", from the trains-agent setup perspective this is a fallback if no "installed packages" is empty. Also notice that once the venv is ready, it will update back the "installed packages" to the full pip freeze of the newly created venv, this is the state I'm suggesting we cache

bmartinn commented 3 years ago

Hi, I'm updating here that the latest version of clearml-agent now includes venv caching capabilities:

Add this section to your ~/clearml.conf file on the agent's machine

agent {
    # cached virtual environment folder
    venvs_cache: {
        # maximum number of cached venvs
        max_entries: 10
        # minimum required free space to allow for cache entry, disable by passing 0 or negative value
        free_space_threshold_gb: 2.0
        # unmark to enable virtual environment caching
        path: ~/.clearml/venvs-cache
    }, 
}

Reference here: https://github.com/allegroai/clearml-agent/blob/22d5892b12efa2acde304658ad0f08594b3e4ce6/docs/clearml.conf#L93

And upgrade and restart the clearml-agent

pip install clearml-agent==0.17.2rc2