allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Tasks fail when running an agent inside Notebook 2: Remote Agent #1204

Open sobek1886 opened 4 months ago

sobek1886 commented 4 months ago

Describe the bug

A clear and concise description of what the bug is.

Hi. I'm going through the getting started Colab notebooks. I run into an issue when trying to execute tasks using an agent set-up through the 2nd tutorial notebook.

I start the agent using

!clearml-agent daemon --queue "default" --foreground

Then, every time when I enqueue a task, it fails. Example terminal output:

Executing task id [bfd4a833c8ba497794ecda6b4332b3aa]:
repository = 
branch = 
version_num = 
tag = 
docker_cmd = 
entry_point = colab_kernel_launcher.py
working_dir = .

::: Using Cached environment /root/.clearml/venvs-cache/6323bc2138a003ed039770fdc7da9483.b0b672ff6952198853ffbbe536f4e324 :::

Adding venv into cache: /root/.clearml/venvs-builds/3.10
Running task id [bfd4a833c8ba497794ecda6b4332b3aa]:
[.]$ /root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Summary - installed python packages:
pip:
- asttokens==2.4.1
- attrs==23.2.0
- cachetools==5.3.2
- certifi==2024.2.2
- charset-normalizer==3.3.2
- clearml==1.14.3
- Cython==3.0.8
- decorator==5.1.1
- exceptiongroup==1.2.0
- executing==2.0.1
- furl==2.1.3
- google-api-core==2.17.0
- google-auth==2.27.0
- google-cloud-core==2.4.1
- google-cloud-storage==2.8.0
- google-crc32c==1.5.0
- google-resumable-media==2.7.0
- googleapis-common-protos==1.62.0
- idna==3.6
- ipykernel==5.5.6
- ipython==8.21.0
- ipython-genutils==0.2.0
- jedi==0.19.1
- jsonschema==4.21.1
- jsonschema-specifications==2023.12.1
- jupyter_client==8.6.0
- jupyter_core==5.7.1
- matplotlib-inline==0.1.6
- numpy==1.26.4
- orderedmultidict==1.0.1
- parso==0.8.3
- pathlib2==2.3.7.post1
- pexpect==4.9.0
- pillow==10.2.0
- platformdirs==4.2.0
- prompt-toolkit==3.0.43
- protobuf==4.25.2
- psutil==5.9.8
- ptyprocess==0.7.0
- pure-eval==0.2.2
- pyasn1==0.5.1
- pyasn1-modules==0.3.0
- Pygments==2.17.2
- PyJWT==2.8.0
- pyparsing==3.1.1
- python-dateutil==2.8.2
- PyYAML==6.0.1
- pyzmq==23.2.1
- referencing==0.33.0
- requests==2.31.0
- rpds-py==0.18.0
- rsa==4.9
- six==1.16.0
- stack-data==0.6.3
- tornado==6.4
- traitlets==5.14.1
- urllib3==2.2.0
- wcwidth==0.2.13

Environment setup completed successfully

Starting Task Execution:

[ColabKernelApp] CRITICAL | Bad config encountered during initialization: The 'kernel_class' trait of <__main__.ColabKernelApp object at 0x7c5aa3978160> instance must be a type, but 'google.colab._kernel.Kernel' could not be imported

Leaving process id 3941
DONE: Running task 'bfd4a833c8ba497794ecda6b4332b3aa', exit status 1

Expected behaviour

What is the expected behaviour? What should've happened but didn't? I expected the tasks to execute successfully.

Environment

Jonasmpi commented 4 months ago

Having the same issue here following the tutorial for remote colab agents

tkukurin commented 4 months ago

The problem is the diff you probably have in your task, sent #1220 to hopefully fix.

This is the diff ("uncomitted changes") you'll probably see if you open the task in your ClearML project:

# Copyright 2023 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Custom kernel launcher app to customize socket options."""

from ipykernel import kernelapp
import zmq

# We want to set the high water mark on *all* sockets to 0, as we don't want
# the backend dropping any messages. We want to set this before any calls to
# bind or connect.
#
# In principle we should override `init_sockets`, but it's hard to set options
# on the `zmq.Context` there without rewriting the entire method. Instead we
# settle for only setting this on `iopub`, as that's the most important for our
# use case.
class ColabKernelApp(kernelapp.IPKernelApp):

  def init_iopub(self, context):
    context.setsockopt(zmq.RCVHWM, 0)
    context.setsockopt(zmq.SNDHWM, 0)
    return super().init_iopub(context)

if __name__ == '__main__':
  ColabKernelApp.launch_instance()
Jonasmpi commented 4 months ago

The problem is the diff you probably have in your task, sent #1220 to hopefully fix.

This is the diff ("uncomitted changes") you'll probably see if you open the task in your ClearML project:

# Copyright 2023 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Custom kernel launcher app to customize socket options."""

from ipykernel import kernelapp
import zmq

# We want to set the high water mark on *all* sockets to 0, as we don't want
# the backend dropping any messages. We want to set this before any calls to
# bind or connect.
#
# In principle we should override `init_sockets`, but it's hard to set options
# on the `zmq.Context` there without rewriting the entire method. Instead we
# settle for only setting this on `iopub`, as that's the most important for our
# use case.
class ColabKernelApp(kernelapp.IPKernelApp):

  def init_iopub(self, context):
    context.setsockopt(zmq.RCVHWM, 0)
    context.setsockopt(zmq.SNDHWM, 0)
    return super().init_iopub(context)

if __name__ == '__main__':
  ColabKernelApp.launch_instance()

From my side the issue also occured with normal projects, it was just public replicable on the tutorial too. Will try again tomorrow

pollfly commented 1 month ago

Hey @sobek1886! Just letting you know that this issue has been resolved in v1.15.0. Let us know if there are any issues :)