allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

ClearML does not find all packages #1245

Closed terbed closed 2 months ago

terbed commented 2 months ago

Describe the bug

I cannot reproduce experiments remotely, because the environment is improperly constructed. The recognized packages:

# Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]

clearml == 1.15.1
kiwisolver == 1.4.5
lightning == 2.2.0.post0
torch == 2.2.0+cu118

Actual packages in the environment:

numpy==1.26.3
PyYAML==6.0.1
torch==2.2.0+cu118
torchmetrics==1.3.1
torchvision==0.17.0+cu118
tqdm==4.66.2
lightning==2.2.0.post0
lightning[pytorch-extra]
matplotlib
pandas

So the remotely reproduced training fails because torchvision is not installed in the env:


- nvidia-nccl-cu12==2.19.3
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.1.105
- orderedmultidict==1.0.1
- packaging==24.0
- pathlib2==2.3.7.post1
- pillow==10.3.0
- platformdirs==4.2.0
- psutil==5.9.8
- PyJWT==2.8.0
- pyparsing==3.1.2
- python-dateutil==2.8.2
- pytorch-lightning==2.2.1
- PyYAML==6.0.1
- referencing==0.34.0
- requests==2.31.0
- rpds-py==0.18.0
- six==1.16.0
- sympy==1.12
- torch==2.2.0+cu121
- torchmetrics==1.3.2
- tqdm==4.66.2
- triton==2.2.0
- typing_extensions==4.11.0
- urllib3==1.26.18
- virtualenv==20.25.1
- yarl==1.9.4
Environment setup completed successfully
Starting Task Execution:
2024-04-11 22:42:01
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/main.py", line 2, in <module>
    from src.data import PRDataModule
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/src/data.py", line 3, in <module>
    import torchvision.transforms.functional as tvf
ModuleNotFoundError: No module named 'torchvision'
2024-04-11 22:42:01
Process failed, exit code 1

Environment

wxdrizzle commented 2 months ago

I have a similar issue #1198 which has also not been resolved yet. But in my case I found a very strange solution is to add a line import tmp in my main.py to execute, where tmp.py is an empty file (no code inside it) in the same folder.

terbed commented 2 months ago

Hi @wxdrizzle, Thanks for linking in your similar issue.

Some updates:

eugen-ajechiloae-clearml commented 2 months ago

Hi @terbed @wxdrizzle ! We are using a pigar fork to auto-fetch the requirements. Only top-level imports will be fetched (see faqs): https://github.com/damnever/pigar#faq. Also, note that only packages will be inspected, so the __init__.py is mandatory if you wish local files to be inspected.

There are a few ways to specify other packages/other auto-detection machanism:

  1. Use one of these functions to explicitly mention packages:
  2. Use https://clear.ml/docs/latest/docs/references/sdk/task#taskforce_requirements_env_freeze to use pip freeze or conda list for package detection (or set sdk.development.detect_with_pip_freeze or development.detect_with_conda_freeze to true in clearml.conf to achieve the same thing)
terbed commented 2 months ago

Hi @eugen-ajechiloae-clearml, Thank you for the information, this explains everything! :)

Best wishes, Daniel

wxdrizzle commented 2 months ago

Thank you very much for the detailed explanation!