Open stephanbertl opened 1 year ago
Hi @stephanbertl ! Looks like the agent gets stuck after pulling a package from your cache. Can you clear your caches? Also, is the agent able to reach https://nexus.local/repository/pypi/simple
?
The agent can reach the nexus repository just fine. It can download a few dependencies but then gets stuck. caches are clear. I think there is something odd going on. We use our on premise Nexus repo daily and never seen such issues.
Same problem when using conda mode inside a docker.
It's indeed very strange. It must be a combination of Nexus Repository and clearml. We are running several other python workflows and they are all successfully downloading packages.
Executing Conda: /opt/conda/bin/conda install -p /root/.clearml/venvs-builds/3.10 -c https://artifacts.local/repository/anaconda-proxy/main 'pip<20.2 ; python_version < '"'"'3.10'"'"'' 'pip<22.3 ; python_version >= '"'"'3.10'"'"'' --quiet --json
Conda error: CondaValueError: invalid package specification: pip<20.2 ; python_version < '3.10
Conda: Trying to install requirements:
['cudatoolkit=10.2']
Executing Conda: /opt/conda/bin/conda env update -p /root/.clearml/venvs-builds/3.10 --file /tmp/conda_envfps563o1.yml --quiet --json
2023-05-15 10:19:28
Pass
Conda: Installing requirements: step 2 - using pip:
['absl-py==1.4.0', 'attrs==23.1.0', 'cachetools==5.3.0', 'certifi==2023.5.7', 'charset-normalizer==3.1.0', 'clearml==1.10.4', 'Cython==0.29.34', 'furl==2.1.3', 'google-auth==2.18.0', 'google-auth-oauthlib==0.4.6', 'grpcio==1.54.0', 'idna==3.4', 'jsonschema==4.17.3', 'Markdown==3.4.3', 'MarkupSafe==2.1.2', 'numpy==1.24.3', 'nvidia-cublas-cu11==11.10.3.66', 'nvidia-cuda-nvrtc-cu11==11.7.99', 'nvidia-cuda-runtime-cu11==11.7.99', 'nvidia-cudnn-cu11==8.5.0.96', 'oauthlib==3.2.2', 'orderedmultidict==1.0.1', 'pathlib2==2.3.7.post1', 'Pillow==9.5.0', 'protobuf==3.20.3', 'psutil==5.9.5', 'pyasn1==0.5.0', 'pyasn1-modules==0.3.0', 'PyJWT==2.4.0', 'pyparsing==3.0.9', 'pyrsistent==0.19.3', 'python-dateutil==2.8.2', 'PyYAML==6.0', 'requests==2.30.0', 'requests-oauthlib==1.3.1', 'rsa==4.9', 'six==1.16.0', 'tensorboard==2.11.0', 'tensorboard-data-server==0.6.1', 'tensorboard-plugin-wit==1.8.1', 'torch==2.0.1', 'torchvision==2.0.1', 'torchaudio==2.0.1', 'typing_extensions==4.5.0', 'urllib3==1.26.15', 'Werkzeug==2.3.4']
Looking in indexes: https://artifacts.local/repository/pypi/simple
Collecting Cython==0.29.34
Downloading https://artifacts.local/repository/pypi/packages/cython/0.29.34/Cython-0.29.34-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 84.5 MB/s eta 0:00:00
?25hInstalling collected packages: Cython
Successfully installed Cython-0.29.34
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://artifacts.local/repository/pypi/simple
Collecting numpy==1.24.3
Downloading https://artifacts.local/repository/pypi/packages/numpy/1.24.3/numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 64.5 MB/s eta 0:00:00
?25hInstalling collected packages: numpy
2023-05-15 10:19:33
Successfully installed numpy-1.24.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Hi @stephanbertl, we actually use Nexus too internally (and externally) and do not experience and issues, so I do not think this is specifically Nexus and clearml, but perhaps something related to your specific setup.
Actually, from the log this seems more to be an agent connectivity issue with the ClearML server than something related to the installation process (correct me if I'm wrong, but it seems installation is successful and than it hangs, right?)
I would suggest running the agent with the --debug
flag to try and get some more visibility
Thanks for the hint, we found some more interesting logs now.
It tries to connect to pytorch.org, even though pip is configured to use only the nexus proxy.
I understand that clearml rewrites the torch version to match the cuda environment, right?
We cannot proxy the download.pytorch.org repository with nexus because it is not pep503 compliant. See here: https://github.com/pytorch/pytorch/issues/25639
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:53:59
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:54:40
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:55:20
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:56:00
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:56:40
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:57:20
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
Some update. We proxied download.pytorch.org
However, clearml-agent cannot download it correctly.
Package(s) not found: torch
clearml_agent: Warning: could not resolve python wheel replacement for torch==2.0.1
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 517, in replace
new_req = self._replace(req)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 559, in _replace
six.raise_from(exc, None)
File "<string>", line 3, in raise_from
clearml_agent.helper.package.pytorch.PytorchResolutionError: Could not find pytorch wheel URL for: torch==2.0.1 with cuda 117 support
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 3020, in install_requirements_for_package_api
package_api.load_requirements(cached_requirements)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pip_api/venv.py", line 40, in load_requirements
requirements["pip"] = self.requirements_manager.replace(requirements["pip"])
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 630, in replace
new_requirements = tuple(replace_one(i, req) for i, req in enumerate(parsed_requirements))
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 630, in <genexpr>
new_requirements = tuple(replace_one(i, req) for i, req in enumerate(parsed_requirements))
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 621, in replace_one
return self._replace_one(req)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 607, in _replace_one
return handler.replace(req)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 524, in replace
raise PytorchResolutionError("{}: {}".format(message, e))
clearml_agent.helper.package.pytorch.PytorchResolutionError: Exception when trying to resolve python wheel: Could not find pytorch wheel URL for: torch==2.0.1 with cuda 117 support
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 87, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 83, in main
return run_command(parser, args, command_name)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 46, in run_command
return func(**args_dict)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/base.py", line 63, in newfunc
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 2495, in execute
self.install_requirements(
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 2971, in install_requirements
return self.install_requirements_for_package_api(execution, repo_info, requirements_manager,
File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 3024, in install_requirements_for_package_api
raise ValueError("Could not install task requirements!\n{}".format(e))
ValueError: Could not install task requirements!
Can someone have a look and tell us what is going on? Installing torch from the docker container just works fine. The used image is based on nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04
pip install torch==2.0.1+cu117 --index-url https://download.pytorch.org/whl/cu117
What is clearml trying to do? why cant it just use our provided agent.package_manager.extra_index_url
parameters?
What kind of lookup does it do to find pytorch version? Nothing shows up in the debug log.
Non-docker mode seems to work fine and download torch. Docker mode doesn't succeed looking up pytorch. Also, --debug does not seem to have an effect on python logs in docker mode
We are trying to enqueue a clearml task that uses torch.
clearml-agent daemon --queue GPU --foreground
Environment
Any task referencing torch gets stuck without any error, hint or timeout. Tasks with tensorflow work fine.
Activating the virtualenv created by clearml and running torch works.
Output from the agent: