allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
242 stars 92 forks source link

ClearML does not find all packages #198

Closed terbed closed 7 months ago

terbed commented 7 months ago

Hello,

I cannot reproduce experiments remotely, because the environment is not constructed correctly. The recognized packages:

# Python 3.11.8 (main, Feb 12 2024, 14:50:05) [GCC 13.2.1 20230801]

clearml == 1.15.1
kiwisolver == 1.4.5
lightning == 2.2.0.post0
torch == 2.2.0+cu118

Actual packages in the environment:

numpy==1.26.3
PyYAML==6.0.1
torch==2.2.0+cu118
torchmetrics==1.3.1
torchvision==0.17.0+cu118
tqdm==4.66.2
lightning==2.2.0.post0
lightning[pytorch-extra]
matplotlib
pandas

So the remotely reproduced training fails because torchvision is not installed in the env:


- nvidia-nccl-cu12==2.19.3
- nvidia-nvjitlink-cu12==12.4.127
- nvidia-nvtx-cu12==12.1.105
- orderedmultidict==1.0.1
- packaging==24.0
- pathlib2==2.3.7.post1
- pillow==10.3.0
- platformdirs==4.2.0
- psutil==5.9.8
- PyJWT==2.8.0
- pyparsing==3.1.2
- python-dateutil==2.8.2
- pytorch-lightning==2.2.1
- PyYAML==6.0.1
- referencing==0.34.0
- requests==2.31.0
- rpds-py==0.18.0
- six==1.16.0
- sympy==1.12
- torch==2.2.0+cu121
- torchmetrics==1.3.2
- tqdm==4.66.2
- triton==2.2.0
- typing_extensions==4.11.0
- urllib3==1.26.18
- virtualenv==20.25.1
- yarl==1.9.4
Environment setup completed successfully
Starting Task Execution:
2024-04-11 22:42:01
Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/main.py", line 2, in <module>
    from src.data import PRDataModule
  File "/root/.clearml/venvs-builds/3.10/task_repository/PhaseReconstruction.git/src/data.py", line 3, in <module>
    import torchvision.transforms.functional as tvf
ModuleNotFoundError: No module named 'torchvision'
2024-04-11 22:42:01
Process failed, exit code 1
jkhenning commented 7 months ago

Hi @terbed, Can you share the full execution log? Which agent version did you use when running the experiment remotely?

terbed commented 7 months ago

Hi @jkhenning, CLEARML-AGENT version 1.8.0 Self hosted server: WebApp: 1.15.0-472 • Server: 1.15.0-472 • API: 2.29

Unfortunately, I have only another log where I tried to set up python3.11, which did not work out well: task_a6e71e6fdd704d68a45b4f5ec6eb6ad8.log

I've lost the log from the mentioned issue, as I made a workaround by adding this to the config:

    # optional shell script to run in docker when started before the experiment is started
    extra_docker_shell_script: ["pip install torchvision", "pip install lightning[pytorch-extra]"]

which is far from ideal.

terbed commented 7 months ago

Some updates on the issue:

The full log: task_9f8e874ec6eb49ada468f18be0059d3f.log

terbed commented 7 months ago

In the other issue page I got the answer: https://github.com/allegroai/clearml/issues/1245#issuecomment-2056962265