Closed abfshaal closed 1 year ago
Hi @abfshaal , how are you running the agent? Are you referring to the agent-services
service, or another agent running in a separate docker container?
The error you describe seems to indicate the agent simply can't get to the server and is stuck waiting for the connection to be established...
Hi @jkhenning, I am running the agent with this command on my local machine clearml-agent daemon --cpu-only --docker python:3.9-bullseye --queue default --foreground
it is not in a docker container, it is just an agent running with docker config, instead of pyenv.
the deployment of clearmlserver is through the yaml file shared.
both the agent and the clearml deployment are on the same machine
The log shared , is from the
clearml-agent daemon --cpu-only --docker python:3.9-bullseye --queue default --foreground
However this log getting stuck at that point is only happening when I enqueue from the local clearml deployment. If I go to the app.clearml and do the same example experiment (with editing the clearml.conf) with the same clearml-agent command, it runs fine. Which makes me think that there is an issue with the docker-compose deployment, what do you think?
@abfshaal how is the clearml.conf
file configured for this agent?
# ClearML SDK configuration file
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server: http://localhost:8008
web_server: http://localhost:8080
files_server: http://localhost:8081
# Credentials are generated using the webapp, http://localhost:8080/settings
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key": "****", "secret_key": "****"}
}
# api {
# # Abdulraheem Sha'al's workspace
# web_server: https://app.clear.ml
# api_server: https://api.clear.ml
# files_server: https://files.clear.ml
# credentials {
# "access_key" = "****"
# "secret_key" = "****"
# }
# }
sdk {
# ClearML - default SDK configuration
storage {
cache {
# Defaults to system temp folder / cache
default_base_dir: "~/.clearml/cache"
# default_cache_manager_size: 100
}
direct_access: [
# Objects matching are considered to be available for direct access, i.e. they will not be downloaded
# or cached, and any download request will return a direct reference.
# Objects are specified in glob format, available for url and content_type.
{ url: "file://*" } # file-urls are always directly referenced
]
}
metrics {
# History size for debug files per metric/variant. For each metric/variant combination with an attached file
# (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
# X files are stored in the upload destination for each metric/variant combination.
file_history_size: 100
# Max history size for matplotlib imshow files per plot title.
# File names for the uploaded images will be recycled in such a way that no more than
# X images are stored in the upload destination for each matplotlib plot title.
matplotlib_untitled_history_size: 100
# Limit the number of digits after the dot in plot reporting (reducing plot report size)
# plot_max_num_digits: 5
# Settings for generated debug images
images {
format: JPEG
quality: 87
subsampling: 0
}
# Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
tensorboard_single_series_per_graph: false
}
network {
# Number of retries before failing to upload file
file_upload_retries: 3
metrics {
# Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
# a specific iteration
file_upload_threads: 4
# Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
# being sent for upload
file_upload_starvation_warning_sec: 120
}
iteration {
# Max number of retries when getting frames if the server returned an error (http code 500)
max_retries_on_server_error: 5
# Backoff factory for consecutive retry attempts.
# SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
retry_backoff_factor_sec: 10
}
}
aws {
s3 {
# S3 credentials, used for read/write access by various SDK elements
# The following settings will be used for any bucket not specified below in the "credentials" section
# ---------------------------------------------------------------------------------------------------
region: ""
# Specify explicit keys
key: ""
secret: ""
# Or enable credentials chain to let Boto3 pick the right credentials.
# This includes picking credentials from environment variables,
# credential file and IAM role using metadata service.
# Refer to the latest Boto3 docs
use_credentials_chain: false
# Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials".
extra_args: {}
# ---------------------------------------------------------------------------------------------------
credentials: [
# specifies key/secret credentials to use when handling s3 urls (read or write)
# {
# bucket: "my-bucket-name"
# key: "my-access-key"
# secret: "my-secret-key"
# },
# {
# # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
# host: "my-minio-host:9000"
# key: "12345678"
# secret: "12345678"
# multipart: false
# secure: false
# }
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
}
google.storage {
# # Default project and credentials file
# # Will be used when no bucket configuration is found
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# pool_connections: 512
# pool_maxsize: 1024
# # Specific credentials per bucket and sub directory
# credentials = [
# {
# bucket: "my-bucket"
# subdir: "path/in/bucket" # Not required
# project: "clearml"
# credentials_json: "/path/to/credentials.json"
# },
# ]
}
azure.storage {
# max_connections: 2
# containers: [
# {
# account_name: "clearml"
# account_key: "secret"
# # container_name:
# }
# ]
}
log {
# debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
null_log_propagate: false
task_log_buffer_capacity: 66
# disable urllib info and lower levels
disable_urllib3_info: true
}
development {
# Development-mode options
# dev task reuse window
task_reuse_time_window_in_hours: 72.0
# Run VCS repository detection asynchronously
vcs_repo_detect_async: true
# Store uncommitted git/hg source code diff in experiment manifest when training in development mode
# This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
store_uncommitted_code_diff: true
# Support stopping an experiment in case it was externally stopped, status was changed or task was reset
support_stopping: true
# Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
default_output_uri: ""
# Default auto generated requirements optimize for smaller requirements
# If True, analyze the entire repository regardless of the entry point.
# If False, first analyze the entry point script, if it does not contain other to local files,
# do not analyze the entire repository.
force_analyze_entire_repo: false
# If set to true, *clearml* update message will not be printed to the console
# this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
suppress_update_message: false
# If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
detect_with_pip_freeze: false
# Log specific environment variables. OS environments are listed in the "Environment" section
# of the Hyper-Parameters.
# multiple selected variables are supported including the suffix '*'.
# For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
# This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
# Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
log_os_environments: []
# Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# The number of events to report
report_event_flush_threshold: 100
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
# Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
console_cr_flush_period: 10
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
}
}
# Apply top-level environment section from configuration into os.environ
apply_environment: false
# Top-level environment section is in the form of:
# environment {
# key: value
# ...
# }
# and is applied to the OS environment as `key=value` for each key/value pair
# Apply top-level files section from configuration into local file system
apply_files: false
# Top-level files section allows auto-generating files at designated paths with a predefined contents
# and target format. Options include:
# contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
# format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
# base64-encoded contents string, otherwise ignored
# path: the target file's path, may include ~ and inplace env vars
# target_format: format used to encode contents before writing into the target file. Supported values are json,
# yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
# overwrite: overwrite the target file in case it exists. Default is true.
#
# Example:
# files {
# myfile1 {
# contents: "The quick brown fox jumped over the lazy dog"
# path: "/tmp/fox.txt"
# }
# myjsonfile {
# contents: {
# some {
# nested {
# value: [1, 2, 3, 4]
# }
# }
# }
# path: "/tmp/test.json"
# target_format: json
# }
# }
}
you will find at the start the two server configurations I switch between, one for the local and one of the app-clearml
And can you do curl http://localhost:8008/debug.ping -u "<key>:<secret>"
from the same machine? (key and secret are the values you have in the API section)
I get this result {"meta":{"id":"1b92f40a693b4ecbbff201cfc17911df","trx":"1b92f40a693b4ecbbff201cfc17911df","endpoint":{"name":"debug.ping","requested_version":"2.24","actual_version":"1.0"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":"","error_data":{}},"data":{"msg":"ClearML server"}}%
Update, it seems the issue happens when the agent is on the same machine as the clearml server deployment, I tried to deploy the clearml server on a virtual linux machine, and I started the agent on my local machine, things worked fine.
when both are on the same machine, it feels as if the clearml deployment and the docker container can't communicate with each other for some reason.
Could this have to do with the usage of localhost inside the clearml.conf? I tried to also add the argument --network=clearml_backend as I thought the network could be the issue here, no luck aswell.
Did you run the curl
request from the same machine that has the server deployment? (that's what I meant you to do when I asked about it)
yup sounds right, If i go in to the docker container that gets launched upon queueing a task, the ping command returns curl: (7) Failed to connect to localhost port 8008: Connection refused
Yeah, so something's in the way...
figured it out! to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it! as this gives the docker container that gets launched connection to localhost services
do you think this is something that can be added to the documentation? I feel like I have seen atleast 2 other issues relating to the same thing. Feel free to close the issue with resolved Many thanks for your help @jkhenning
Oh right, missed that 🙂 I'll see what we can do to add that 👍
I had the same issue and solved it thanks to @abfshaal. Seems like nothing about this was added to the docs yet. I would also suggest adding a hint about this because running everything on the same machine is not uncommon especially to try clearml as a newbie.
Describe the bug
I am trying to create a self hosted clearml. I am creating a docker agent on the same machine. when I try to enqueue the task, the runner is getting stuck indefinately on the step
Running Docker: Executing: ['docker', 'run', '-t', '-v', '/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners:/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners', '-e', 'SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners', '-l', 'clearml-worker-id=AshaalL02:cpu:0', '-l', 'clearml-parent-worker-id=AshaalL02:cpu:0', '-e', 'CLEARML_WORKER_ID=AshaalL02:cpu:0', '-e', 'CLEARML_DOCKER_IMAGE=python:3.9-bullseye', '-e', 'CLEARML_TASK_ID=5fc9dfa25cd44f9790bbb8df0d2e7b23', '-v', '/Users/abdulraheemshaal/.gitconfig:/root/.gitconfig', '-v', '/var/folders/xm/27jjjrp13y9bq3657smh4c780000gp/T/.clearml_agent.yuogvi0z.cfg:/tmp/clearml.conf', '-e', 'CLEARML_CONFIG_FILE=/tmp/clearml.conf', '-v', '/Users/abdulraheemshaal/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/Users/abdulraheemshaal/.clearml/pip-cache:/root/.cache/pip', '-v', '/Users/abdulraheemshaal/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/Users/abdulraheemshaal/.clearml/cache:/clearml_agent_cache', '-v', '/Users/abdulraheemshaal/.clearml/vcs-cache:/root/.clearml/vcs-cache', '-v', '/Users/abdulraheemshaal/.clearml/venvs-cache:/root/.clearml/venvs-cache', '--rm', 'python:3.9-bullseye', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2 ; python_version < \'3.10\'" "pip<22.3 ; python_version >= \'3.10\'" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; echo \'we reached here\' ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=none $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 5fc9dfa25cd44f9790bbb8df0d2e7b23'] I do check if there is a docker instance running with docker ps, and I do see one with its logs stuck at pip 22.0.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9) Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB] Get:2 http://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB] Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB] Get:4 http://deb.debian.org/debian bullseye/main arm64 Packages [8072 kB] Get:5 http://deb.debian.org/debian-security bullseye-security/main arm64 Packages [233 kB] Get:6 http://deb.debian.org/debian bullseye-updates/main arm64 Packages [12.0 kB] Fetched 8525 kB in 3s (2594 kB/s) Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Done libglib2.0-0 is already the newest version (2.66.8-1). libglib2.0-0 set to manually installed. libsm6 is already the newest version (2:1.2.3-1). libsm6 set to manually installed. libxext6 is already the newest version (2:1.3.3-1.1). libxext6 set to manually installed. libxrender-dev is already the newest version (1:0.9.10-1). libxrender-dev set to manually installed. 0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded. Ignoring pip: markers 'python_version >= "3.10"' don't match your environment Collecting pip<20.2 Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 22.0.4 Uninstalling pip-22.0.4: Successfully uninstalled pip-22.0.4 Successfully installed pip-20.1.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Collecting clearml-agent Using cached clearml_agent-1.5.2-py3-none-any.whl (401 kB) Collecting jsonschema<5.0.0,>=2.6.0 Using cached jsonschema-4.17.3-py3-none-any.whl (90 kB) Collecting attrs<23.0.0,>=18.0 Using cached attrs-22.2.0-py3-none-any.whl (60 kB) Processing /root/.cache/pip/wheels/74/d1/7d/d9ae7d9aea0f1cebed73f37868df7b5f3333e7f30163b3e558/psutil-5.9.5-cp39-abi3-linux_aarch64.whl Collecting python-dateutil<2.9.0,>=2.4.2 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pyjwt<2.7.0,>=2.4.0 Using cached PyJWT-2.6.0-py3-none-any.whl (20 kB) Collecting pyparsing<3.1.0,>=2.0.3 Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB) Collecting PyYAML<6.1,>=3.12 Using cached PyYAML-6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (731 kB) Collecting pathlib2<2.4.0,>=2.3.0 Using cached pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB) Collecting virtualenv<21,>=16 Using cached virtualenv-20.22.0-py3-none-any.whl (3.2 MB) Collecting furl<2.2.0,>=2.0.0 Using cached furl-2.1.3-py2.py3-none-any.whl (20 kB) Collecting requests<2.29.0,>=2.20.0 Using cached requests-2.28.2-py3-none-any.whl (62 kB) Collecting urllib3<1.27.0,>=1.21.1 Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB) Collecting six<1.17.0,>=1.13.0 Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Collecting pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 Using cached pyrsistent-0.19.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (117 kB) Collecting distlib<1,>=0.3.6 Using cached distlib-0.3.6-py2.py3-none-any.whl (468 kB) Collecting filelock<4,>=3.11 Using cached filelock-3.12.0-py3-none-any.whl (10 kB) Collecting platformdirs<4,>=3.2 Using cached platformdirs-3.2.0-py3-none-any.whl (14 kB) Collecting orderedmultidict>=1.0.1 Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB) Collecting idna<4,>=2.5 Using cached idna-3.4-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (196 kB) Collecting certifi>=2017.4.17 Using cached certifi-2022.12.7-py3-none-any.whl (155 kB) Installing collected packages: attrs, pyrsistent, jsonschema, psutil, six, python-dateutil, pyjwt, pyparsing, PyYAML, pathlib2, distlib, filelock, platformdirs, virtualenv, orderedmultidict, furl, idna, charset-normalizer, urllib3, certifi, requests, clearml-agent Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.2.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.22.0 WARNING: You are using pip version 20.1.1; however, version 23.1 is available. You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.
If I do add execute custom script, it executes it then hangs.
I tried to do the same, with a local agent docker and the clearml app, it worked fine. The issue is happening with my self hosted deployment.
This is the docker-compose for the deployment that I am using
I also tried to run the runner from sudo user, it did not change the outcome. I am completely stuck with this
To reproduce
Create local deployment with ubuntu or MacOS Clone any experiment, enqueue it. Create an agent with docker configuration.
Expected behaviour
I am expecting it to run the enqueued task instead of getting stuck, the same it does with the clearml-app
Environment
Related Discussion
If this continues a slack thread, please provide a link to the original slack thread.