allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 654 forks source link

How to prevent the agent to install cached requirements.txt #793

Open AkideLiu opened 2 years ago

AkideLiu commented 2 years ago

Thank you for helping us making ClearML better!

Describe the bug

To reproduce

The scenario is like that, suppose we have two workers (A, B) in the same queue on different farms with different clearml configurations. Additionally, the storage is not shared between workers A and B.

If we create task X, X is working properly on worker A; when we create a cloned task of X called Y maybe run on worker B, worker B's agent tries to install the specific cached requirements.txt from worker A , causing job failure because this file does not exist on worker B.

clearml_agent: ERROR: Could not install task requirements!
Command '['/tmp/zbw/.clearml/venvs-builds.1/3.8/bin/python', '-m', 'pip', '--disable-pip-version-check', 'install', '-r', '/tmp/cached-reqsa_cqtase.txt', '--extra-index-url', 'https://download.pytorch.org/whl/cu113']' returned non-zero exit status 1.

Expected behaviour

How to modify the agent configuration which leads the agents not to use cached requirements txt from previous jobs?

Environment

jkhenning commented 2 years ago

Hi @AkideLiu,

In the clearml agent, the main idea is to have a task that was already executed be exactly reproducible when cloned - that's the reason the agent attempts to install the exact same requirements. This is usually not a problem when you're running the agent in docker mode since the cloned task will likely be executed using the same docker image and thus will be able to use the exact same requirement - it's only ever a possible issue when using the venv mode which is inherently less stable (i.e. depends on the actual machine/OS/environment the agent is running on). You can always override it by clearing the requirements from the cloned task in the UI in which case the agent will try to install requirements.txt from the git repository.

Implementing the behavior you are looking for is actually new feature request (first try to install the "stored full requirements", then if that fails try to install the "original" requirements), and we will put it into our task list 🙂

AkideLiu commented 2 years ago

Hi @AkideLiu,

In the clearml agent, the main idea is to have a task that was already executed be exactly reproducible when cloned - that's the reason the agent attempts to install the exact same requirements. This is usually not a problem when you're running the agent in docker mode since the cloned task will likely be executed using the same docker image and thus will be able to use the exact same requirement - it's only ever a possible issue when using the venv mode which is inherently less stable (i.e. depends on the actual machine/OS/environment the agent is running on). You can always override it by clearing the requirements from the cloned task in the UI in which case the agent will try to install requirements.txt from the git repository.

Implementing the behavior you are looking for is actually new feature request (first try to install the "stored full requirements", then if that fails try to install the "original" requirements), and we will put it into our task list 🙂

Hi @jkhenning , thanks for your clarification, could you please let me know how to remove cached requirements from web UI?

Hopefully, you can consider implementing some kind of feature to retry the failed dependency installation. Because for some use cases it's might hard or even not impossible to leverage advanced containerization techniques. For example, the slurm cluster might not native support docker mode.

ainoam commented 2 months ago

@AkideLiu Please note that as of ClearML Server v1.14.0, package cache is available in the ClearML UI.