allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
241 stars 91 forks source link

clearml agent stuck when using torch #151

Open stephanbertl opened 1 year ago

stephanbertl commented 1 year ago

We are trying to enqueue a clearml task that uses torch.

clearml-agent daemon --queue GPU --foreground

Environment

Any task referencing torch gets stuck without any error, hint or timeout. Tasks with tensorflow work fine.

Activating the virtualenv created by clearml and running torch works.

Output from the agent:

Current configuration (clearml_agent v1.5.2, location: /tmp/.clearml_agent.xsyv012z.cfg)
----------------------
api.version = 1.5
api.verify_certificate = false
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.web_server = http://app.clearml.test.info
api.api_server = http://api.clearml.test.info
api.files_server = http://files.clearml.test.info
api.credentials.access_key = GB5QSM2ASB6MUZ0P2NTP
api.host = http://api.clearml.test.info
agent.worker_id = sap201dd:0
agent.worker_name = sap201dd
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <22.3 ; python_version >\= '3.10'
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.package_manager.extra_index_url.0 = https://nexus.local/repository/pypi/simple
agent.venvs_dir = /home/adminuser/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/adminuser/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/adminuser/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/adminuser/.clearml/pip-cache
agent.docker_apt_cache = /home/adminuser/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.cuda_version = 118
agent.cudnn_version = 86
agent.default_python = 3.10
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri = http://files.clearml.test.info
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Executing task id [f0ffdf88fb984a169f29d6ceb5710832]:
repository =
branch =
version_num =
tag =
docker_cmd =
entry_point = gpu_test.py
working_dir = .

Python executable with version '3.7' requested by the Task, not found in path, using '/usr/bin/python3' (v3.10.6) instead
created virtual environment CPython3.10.6.final.0-64 in 131ms
  creator CPython3Posix(dest=/home/adminuser/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/adminuser/.local/share/virtualenv)
    added seed packages: pip==23.0.1, setuptools==67.4.0, wheel==0.38.4
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

Looking in indexes: https://nexus.local/repository/pypi/simple
Collecting pip<22.3
  Using cached https://nexus.local/repository/pypi/packages/pip/22.2.2/pip-22.2.2-py3-none-any.whl (2.0 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-22.2.2
Looking in indexes: https://nexus.local/repository/pypi/simple
Collecting Cython
  Using cached https://nexus.local/repository/pypi/packages/cython/0.29.34/Cython-0.29.34-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
eugen-ajechiloae-clearml commented 1 year ago

Hi @stephanbertl ! Looks like the agent gets stuck after pulling a package from your cache. Can you clear your caches? Also, is the agent able to reach https://nexus.local/repository/pypi/simple?

stephanbertl commented 1 year ago

The agent can reach the nexus repository just fine. It can download a few dependencies but then gets stuck. caches are clear. I think there is something odd going on. We use our on premise Nexus repo daily and never seen such issues.

stephanbertl commented 1 year ago

Same problem when using conda mode inside a docker.

It's indeed very strange. It must be a combination of Nexus Repository and clearml. We are running several other python workflows and they are all successfully downloading packages.

Executing Conda: /opt/conda/bin/conda install -p /root/.clearml/venvs-builds/3.10 -c https://artifacts.local/repository/anaconda-proxy/main 'pip<20.2 ; python_version < '"'"'3.10'"'"'' 'pip<22.3 ; python_version >= '"'"'3.10'"'"'' --quiet --json
Conda error: CondaValueError: invalid package specification: pip<20.2 ; python_version < '3.10
Conda: Trying to install requirements:
['cudatoolkit=10.2']
Executing Conda: /opt/conda/bin/conda env update -p /root/.clearml/venvs-builds/3.10 --file /tmp/conda_envfps563o1.yml --quiet --json
2023-05-15 10:19:28
Pass
Conda: Installing requirements: step 2 - using pip:
['absl-py==1.4.0', 'attrs==23.1.0', 'cachetools==5.3.0', 'certifi==2023.5.7', 'charset-normalizer==3.1.0', 'clearml==1.10.4', 'Cython==0.29.34', 'furl==2.1.3', 'google-auth==2.18.0', 'google-auth-oauthlib==0.4.6', 'grpcio==1.54.0', 'idna==3.4', 'jsonschema==4.17.3', 'Markdown==3.4.3', 'MarkupSafe==2.1.2', 'numpy==1.24.3', 'nvidia-cublas-cu11==11.10.3.66', 'nvidia-cuda-nvrtc-cu11==11.7.99', 'nvidia-cuda-runtime-cu11==11.7.99', 'nvidia-cudnn-cu11==8.5.0.96', 'oauthlib==3.2.2', 'orderedmultidict==1.0.1', 'pathlib2==2.3.7.post1', 'Pillow==9.5.0', 'protobuf==3.20.3', 'psutil==5.9.5', 'pyasn1==0.5.0', 'pyasn1-modules==0.3.0', 'PyJWT==2.4.0', 'pyparsing==3.0.9', 'pyrsistent==0.19.3', 'python-dateutil==2.8.2', 'PyYAML==6.0', 'requests==2.30.0', 'requests-oauthlib==1.3.1', 'rsa==4.9', 'six==1.16.0', 'tensorboard==2.11.0', 'tensorboard-data-server==0.6.1', 'tensorboard-plugin-wit==1.8.1', 'torch==2.0.1', 'torchvision==2.0.1', 'torchaudio==2.0.1', 'typing_extensions==4.5.0', 'urllib3==1.26.15', 'Werkzeug==2.3.4']
Looking in indexes: https://artifacts.local/repository/pypi/simple
Collecting Cython==0.29.34
  Downloading https://artifacts.local/repository/pypi/packages/cython/0.29.34/Cython-0.29.34-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.9/1.9 MB 84.5 MB/s eta 0:00:00
?25hInstalling collected packages: Cython
Successfully installed Cython-0.29.34
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://artifacts.local/repository/pypi/simple
Collecting numpy==1.24.3
  Downloading https://artifacts.local/repository/pypi/packages/numpy/1.24.3/numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 64.5 MB/s eta 0:00:00
?25hInstalling collected packages: numpy
2023-05-15 10:19:33
Successfully installed numpy-1.24.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
jkhenning commented 1 year ago

Hi @stephanbertl, we actually use Nexus too internally (and externally) and do not experience and issues, so I do not think this is specifically Nexus and clearml, but perhaps something related to your specific setup.

Actually, from the log this seems more to be an agent connectivity issue with the ClearML server than something related to the installation process (correct me if I'm wrong, but it seems installation is successful and than it hangs, right?)

I would suggest running the agent with the --debug flag to try and get some more visibility

stephanbertl commented 1 year ago

Thanks for the hint, we found some more interesting logs now.

It tries to connect to pytorch.org, even though pip is configured to use only the nexus proxy.

I understand that clearml rewrites the torch version to match the cuda environment, right?

We cannot proxy the download.pytorch.org repository with nexus because it is not pep503 compliant. See here: https://github.com/pytorch/pytorch/issues/25639


DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:53:59
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:54:40
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:55:20
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:56:00
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:56:40
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
2023-05-15 14:57:20
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): download.pytorch.org:443
stephanbertl commented 1 year ago

Some update. We proxied download.pytorch.org

However, clearml-agent cannot download it correctly.

Package(s) not found: torch
clearml_agent: Warning: could not resolve python wheel replacement for torch==2.0.1
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 517, in replace
    new_req = self._replace(req)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 559, in _replace
    six.raise_from(exc, None)
  File "<string>", line 3, in raise_from
clearml_agent.helper.package.pytorch.PytorchResolutionError: Could not find pytorch wheel URL for: torch==2.0.1 with cuda 117 support

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 3020, in install_requirements_for_package_api
    package_api.load_requirements(cached_requirements)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pip_api/venv.py", line 40, in load_requirements
    requirements["pip"] = self.requirements_manager.replace(requirements["pip"])
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 630, in replace
    new_requirements = tuple(replace_one(i, req) for i, req in enumerate(parsed_requirements))
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 630, in <genexpr>
    new_requirements = tuple(replace_one(i, req) for i, req in enumerate(parsed_requirements))
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 621, in replace_one
    return self._replace_one(req)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/requirements.py", line 607, in _replace_one
    return handler.replace(req)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/helper/package/pytorch.py", line 524, in replace
    raise PytorchResolutionError("{}: {}".format(message, e))
clearml_agent.helper.package.pytorch.PytorchResolutionError: Exception when trying to resolve python wheel: Could not find pytorch wheel URL for: torch==2.0.1 with cuda 117 support

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 87, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 83, in main
    return run_command(parser, args, command_name)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/__main__.py", line 46, in run_command
    return func(**args_dict)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/base.py", line 63, in newfunc
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 2495, in execute
    self.install_requirements(
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 2971, in install_requirements
    return self.install_requirements_for_package_api(execution, repo_info, requirements_manager,
  File "/usr/local/lib/python3.10/dist-packages/clearml_agent/commands/worker.py", line 3024, in install_requirements_for_package_api
    raise ValueError("Could not install task requirements!\n{}".format(e))
ValueError: Could not install task requirements!

Can someone have a look and tell us what is going on? Installing torch from the docker container just works fine. The used image is based on nvidia/cuda:11.7.1-cudnn8-runtime-ubuntu22.04

pip install torch==2.0.1+cu117 --index-url https://download.pytorch.org/whl/cu117

What is clearml trying to do? why cant it just use our provided agent.package_manager.extra_index_url parameters?

What kind of lookup does it do to find pytorch version? Nothing shows up in the debug log.

Non-docker mode seems to work fine and download torch. Docker mode doesn't succeed looking up pytorch. Also, --debug does not seem to have an effect on python logs in docker mode