allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Docker-Agent Stuck #994

Closed abfshaal closed 1 year ago

abfshaal commented 1 year ago

Describe the bug

I am trying to create a self hosted clearml. I am creating a docker agent on the same machine. when I try to enqueue the task, the runner is getting stuck indefinately on the step

Running Docker: Executing: ['docker', 'run', '-t', '-v', '/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners:/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners', '-e', 'SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.4wa3OgXMUn/Listeners', '-l', 'clearml-worker-id=AshaalL02:cpu:0', '-l', 'clearml-parent-worker-id=AshaalL02:cpu:0', '-e', 'CLEARML_WORKER_ID=AshaalL02:cpu:0', '-e', 'CLEARML_DOCKER_IMAGE=python:3.9-bullseye', '-e', 'CLEARML_TASK_ID=5fc9dfa25cd44f9790bbb8df0d2e7b23', '-v', '/Users/abdulraheemshaal/.gitconfig:/root/.gitconfig', '-v', '/var/folders/xm/27jjjrp13y9bq3657smh4c780000gp/T/.clearml_agent.yuogvi0z.cfg:/tmp/clearml.conf', '-e', 'CLEARML_CONFIG_FILE=/tmp/clearml.conf', '-v', '/Users/abdulraheemshaal/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/Users/abdulraheemshaal/.clearml/pip-cache:/root/.cache/pip', '-v', '/Users/abdulraheemshaal/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/Users/abdulraheemshaal/.clearml/cache:/clearml_agent_cache', '-v', '/Users/abdulraheemshaal/.clearml/vcs-cache:/root/.clearml/vcs-cache', '-v', '/Users/abdulraheemshaal/.clearml/venvs-cache:/root/.clearml/venvs-cache', '--rm', 'python:3.9-bullseye', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL libsm6 libxext6 libxrender-dev libglib2.0-0" ; [ ! -z $(which git) ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL git" ; declare LOCAL_PYTHON ; [ ! -z $LOCAL_PYTHON ] || for i in {15..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || export CLEARML_APT_INSTALL="$CLEARML_APT_INSTALL python3-pip" ; [ -z "$CLEARML_APT_INSTALL" ] || (apt-get update -y ; apt-get install -y $CLEARML_APT_INSTALL) ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2 ; python_version < \'3.10\'" "pip<22.3 ; python_version >= \'3.10\'" ; $LOCAL_PYTHON -m pip install -U clearml-agent ; echo \'we reached here\' ; cp /tmp/clearml.conf ~/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=none $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 5fc9dfa25cd44f9790bbb8df0d2e7b23'] I do check if there is a docker instance running with docker ps, and I do see one with its logs stuck at pip 22.0.4 from /usr/local/lib/python3.9/site-packages/pip (python 3.9) Get:1 http://deb.debian.org/debian bullseye InRelease [116 kB] Get:2 http://deb.debian.org/debian-security bullseye-security InRelease [48.4 kB] Get:3 http://deb.debian.org/debian bullseye-updates InRelease [44.1 kB] Get:4 http://deb.debian.org/debian bullseye/main arm64 Packages [8072 kB] Get:5 http://deb.debian.org/debian-security bullseye-security/main arm64 Packages [233 kB] Get:6 http://deb.debian.org/debian bullseye-updates/main arm64 Packages [12.0 kB] Fetched 8525 kB in 3s (2594 kB/s) Reading package lists... Done Reading package lists... Done Building dependency tree... Done Reading state information... Done libglib2.0-0 is already the newest version (2.66.8-1). libglib2.0-0 set to manually installed. libsm6 is already the newest version (2:1.2.3-1). libsm6 set to manually installed. libxext6 is already the newest version (2:1.3.3-1.1). libxext6 set to manually installed. libxrender-dev is already the newest version (1:0.9.10-1). libxrender-dev set to manually installed. 0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded. Ignoring pip: markers 'python_version >= "3.10"' don't match your environment Collecting pip<20.2 Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB) Installing collected packages: pip Attempting uninstall: pip Found existing installation: pip 22.0.4 Uninstalling pip-22.0.4: Successfully uninstalled pip-22.0.4 Successfully installed pip-20.1.1 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Collecting clearml-agent Using cached clearml_agent-1.5.2-py3-none-any.whl (401 kB) Collecting jsonschema<5.0.0,>=2.6.0 Using cached jsonschema-4.17.3-py3-none-any.whl (90 kB) Collecting attrs<23.0.0,>=18.0 Using cached attrs-22.2.0-py3-none-any.whl (60 kB) Processing /root/.cache/pip/wheels/74/d1/7d/d9ae7d9aea0f1cebed73f37868df7b5f3333e7f30163b3e558/psutil-5.9.5-cp39-abi3-linux_aarch64.whl Collecting python-dateutil<2.9.0,>=2.4.2 Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB) Collecting pyjwt<2.7.0,>=2.4.0 Using cached PyJWT-2.6.0-py3-none-any.whl (20 kB) Collecting pyparsing<3.1.0,>=2.0.3 Using cached pyparsing-3.0.9-py3-none-any.whl (98 kB) Collecting PyYAML<6.1,>=3.12 Using cached PyYAML-6.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (731 kB) Collecting pathlib2<2.4.0,>=2.3.0 Using cached pathlib2-2.3.7.post1-py2.py3-none-any.whl (18 kB) Collecting virtualenv<21,>=16 Using cached virtualenv-20.22.0-py3-none-any.whl (3.2 MB) Collecting furl<2.2.0,>=2.0.0 Using cached furl-2.1.3-py2.py3-none-any.whl (20 kB) Collecting requests<2.29.0,>=2.20.0 Using cached requests-2.28.2-py3-none-any.whl (62 kB) Collecting urllib3<1.27.0,>=1.21.1 Using cached urllib3-1.26.15-py2.py3-none-any.whl (140 kB) Collecting six<1.17.0,>=1.13.0 Using cached six-1.16.0-py2.py3-none-any.whl (11 kB) Collecting pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 Using cached pyrsistent-0.19.3-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (117 kB) Collecting distlib<1,>=0.3.6 Using cached distlib-0.3.6-py2.py3-none-any.whl (468 kB) Collecting filelock<4,>=3.11 Using cached filelock-3.12.0-py3-none-any.whl (10 kB) Collecting platformdirs<4,>=3.2 Using cached platformdirs-3.2.0-py3-none-any.whl (14 kB) Collecting orderedmultidict>=1.0.1 Using cached orderedmultidict-1.0.1-py2.py3-none-any.whl (11 kB) Collecting idna<4,>=2.5 Using cached idna-3.4-py3-none-any.whl (61 kB) Collecting charset-normalizer<4,>=2 Using cached charset_normalizer-3.1.0-cp39-cp39-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (196 kB) Collecting certifi>=2017.4.17 Using cached certifi-2022.12.7-py3-none-any.whl (155 kB) Installing collected packages: attrs, pyrsistent, jsonschema, psutil, six, python-dateutil, pyjwt, pyparsing, PyYAML, pathlib2, distlib, filelock, platformdirs, virtualenv, orderedmultidict, furl, idna, charset-normalizer, urllib3, certifi, requests, clearml-agent Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7.post1 platformdirs-3.2.0 psutil-5.9.5 pyjwt-2.6.0 pyparsing-3.0.9 pyrsistent-0.19.3 python-dateutil-2.8.2 requests-2.28.2 six-1.16.0 urllib3-1.26.15 virtualenv-20.22.0 WARNING: You are using pip version 20.1.1; however, version 23.1 is available. You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.

If I do add execute custom script, it executes it then hangs.

I tried to do the same, with a local agent docker and the clearml app, it worked fine. The issue is happening with my self hosted deployment.

This is the docker-compose for the deployment that I am using

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
    ports:
    - "8008:8008"
    networks:
      - backend
      - frontend

  elasticsearch:
    networks:
      - backend
    container_name: clearml-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
      ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: clearml
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.7
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs

  fileserver:
    networks:
      - backend
      - frontend
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:latest
    environment:
      CLEARML__fileserver__delete__allow_batch: "true"
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/data/fileserver:/mnt/fileserver
    - /opt/clearml/config:/opt/clearml/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: clearml-mongo
    image: mongo:4.4.9
    restart: unless-stopped
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
    volumes:
    - /opt/clearml/data/mongo_4/db:/data/db
    - /opt/clearml/data/mongo_4/configdb:/data/configdb

  redis:
    networks:
      - backend
    container_name: clearml-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/clearml/data/redis:/data

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    # environment:
    #  CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
    image: allegroai/clearml:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8080:80"
    networks:
      - backend
      - frontend

  async_delete:
    depends_on:
      - apiserver
      - redis
      - mongo
      - elasticsearch
      - fileserver
    container_name: async_delete
    image: allegroai/clearml:latest
    networks:
      - backend
    restart: unless-stopped
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      PYTHONPATH: /opt/clearml/apiserver
      CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
    entrypoint:
      - python3
      - -m
      - jobs.async_urls_delete
      - --fileserver-host
      - http://fileserver:8081
    volumes:
      - /opt/clearml/logs:/var/log/clearml

  agent-services:
    networks:
      - backend
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    deploy:
      restart_policy:
        condition: on-failure
    privileged: true
    environment:
      CLEARML_HOST_IP: http://apiserver:8008
      CLEARML_WEB_HOST: http://webserver:8080
      CLEARML_API_HOST: http://apiserver:8008
      CLEARML_FILES_HOST: http://fileserver:8081
      CLEARML_API_ACCESS_KEY: 0N5ZH1KE4IP569EUBSFC
      CLEARML_API_SECRET_KEY: JulUROVcu94KiyLzGDFAQIYY2yYR8dcOnHTxUdikLthDs98oyk
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/clearml/agent:/root/.clearml
    depends_on:
      - apiserver
    entrypoint: >
      bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"

networks:
  backend:
    driver: bridge
  frontend:
    driver: bridge

I also tried to run the runner from sudo user, it did not change the outcome. I am completely stuck with this

To reproduce

Create local deployment with ubuntu or MacOS Clone any experiment, enqueue it. Create an agent with docker configuration.

Expected behaviour

I am expecting it to run the enqueued task instead of getting stuck, the same it does with the clearml-app

Environment

jkhenning commented 1 year ago

Hi @abfshaal , how are you running the agent? Are you referring to the agent-services service, or another agent running in a separate docker container? The error you describe seems to indicate the agent simply can't get to the server and is stuck waiting for the connection to be established...

abfshaal commented 1 year ago

Hi @jkhenning, I am running the agent with this command on my local machine clearml-agent daemon --cpu-only --docker python:3.9-bullseye --queue default --foreground

it is not in a docker container, it is just an agent running with docker config, instead of pyenv.

the deployment of clearmlserver is through the yaml file shared.

both the agent and the clearml deployment are on the same machine

The log shared , is from the clearml-agent daemon --cpu-only --docker python:3.9-bullseye --queue default --foreground

However this log getting stuck at that point is only happening when I enqueue from the local clearml deployment. If I go to the app.clearml and do the same example experiment (with editing the clearml.conf) with the same clearml-agent command, it runs fine. Which makes me think that there is an issue with the docker-compose deployment, what do you think?

jkhenning commented 1 year ago

@abfshaal how is the clearml.conf file configured for this agent?

abfshaal commented 1 year ago
# ClearML SDK configuration file
api {
    # Notice: 'host' is the api server (default port 8008), not the web server.
    api_server: http://localhost:8008
    web_server: http://localhost:8080
    files_server: http://localhost:8081
    # Credentials are generated using the webapp, http://localhost:8080/settings
    # Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
    credentials {"access_key": "****", "secret_key": "****"}
}
# api {
#   # Abdulraheem Sha'al's workspace
#   web_server: https://app.clear.ml
#   api_server: https://api.clear.ml
#   files_server: https://files.clear.ml
#   credentials {
#     "access_key" = "****"
#     "secret_key" = "****"
#   }
# }
sdk {
    # ClearML - default SDK configuration

    storage {
        cache {
            # Defaults to system temp folder / cache
            default_base_dir: "~/.clearml/cache"
            # default_cache_manager_size: 100
        }

        direct_access: [
            # Objects matching are considered to be available for direct access, i.e. they will not be downloaded
            # or cached, and any download request will return a direct reference.
            # Objects are specified in glob format, available for url and content_type.
            { url: "file://*" }  # file-urls are always directly referenced
        ]
    }

    metrics {
        # History size for debug files per metric/variant. For each metric/variant combination with an attached file
        # (e.g. debug image event), file names for the uploaded files will be recycled in such a way that no more than
        # X files are stored in the upload destination for each metric/variant combination.
        file_history_size: 100

        # Max history size for matplotlib imshow files per plot title.
        # File names for the uploaded images will be recycled in such a way that no more than
        # X images are stored in the upload destination for each matplotlib plot title.
        matplotlib_untitled_history_size: 100

        # Limit the number of digits after the dot in plot reporting (reducing plot report size)
        # plot_max_num_digits: 5

        # Settings for generated debug images
        images {
            format: JPEG
            quality: 87
            subsampling: 0
        }

        # Support plot-per-graph fully matching Tensorboard behavior (i.e. if this is set to true, each series should have its own graph)
        tensorboard_single_series_per_graph: false
    }

    network {
        # Number of retries before failing to upload file
        file_upload_retries: 3

        metrics {
            # Number of threads allocated to uploading files (typically debug images) when transmitting metrics for
            # a specific iteration
            file_upload_threads: 4

            # Warn about upload starvation if no uploads were made in specified period while file-bearing events keep
            # being sent for upload
            file_upload_starvation_warning_sec: 120
        }

        iteration {
            # Max number of retries when getting frames if the server returned an error (http code 500)
            max_retries_on_server_error: 5
            # Backoff factory for consecutive retry attempts.
            # SDK will wait for {backoff factor} * (2 ^ ({number of total retries} - 1)) between retries.
            retry_backoff_factor_sec: 10
        }
    }
    aws {
        s3 {
            # S3 credentials, used for read/write access by various SDK elements

            # The following settings will be used for any bucket not specified below in the "credentials" section
            # ---------------------------------------------------------------------------------------------------
            region: ""
            # Specify explicit keys
            key: ""
            secret: ""
            # Or enable credentials chain to let Boto3 pick the right credentials. 
            # This includes picking credentials from environment variables, 
            # credential file and IAM role using metadata service. 
            # Refer to the latest Boto3 docs
            use_credentials_chain: false
            # Additional ExtraArgs passed to boto3 when uploading files. Can also be set per-bucket under "credentials".
            extra_args: {}
            # ---------------------------------------------------------------------------------------------------

            credentials: [
                # specifies key/secret credentials to use when handling s3 urls (read or write)
                # {
                #     bucket: "my-bucket-name"
                #     key: "my-access-key"
                #     secret: "my-secret-key"
                # },
                # {
                #     # This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
                #     host: "my-minio-host:9000"
                #     key: "12345678"
                #     secret: "12345678"
                #     multipart: false
                #     secure: false
                # }
            ]
        }
        boto3 {
            pool_connections: 512
            max_multipart_concurrency: 16
        }
    }
    google.storage {
        # # Default project and credentials file
        # # Will be used when no bucket configuration is found
        # project: "clearml"
        # credentials_json: "/path/to/credentials.json"
        # pool_connections: 512
        # pool_maxsize: 1024

        # # Specific credentials per bucket and sub directory
        # credentials = [
        #     {
        #         bucket: "my-bucket"
        #         subdir: "path/in/bucket" # Not required
        #         project: "clearml"
        #         credentials_json: "/path/to/credentials.json"
        #     },
        # ]
    }
    azure.storage {
        # max_connections: 2

        # containers: [
        #     {
        #         account_name: "clearml"
        #         account_key: "secret"
        #         # container_name:
        #     }
        # ]
    }

    log {
        # debugging feature: set this to true to make null log propagate messages to root logger (so they appear in stdout)
        null_log_propagate: false
        task_log_buffer_capacity: 66

        # disable urllib info and lower levels
        disable_urllib3_info: true
    }

    development {
        # Development-mode options

        # dev task reuse window
        task_reuse_time_window_in_hours: 72.0

        # Run VCS repository detection asynchronously
        vcs_repo_detect_async: true

        # Store uncommitted git/hg source code diff in experiment manifest when training in development mode
        # This stores "git diff" or "hg diff" into the experiment's "script.requirements.diff" section
        store_uncommitted_code_diff: true

        # Support stopping an experiment in case it was externally stopped, status was changed or task was reset
        support_stopping: true

        # Default Task output_uri. if output_uri is not provided to Task.init, default_output_uri will be used instead.
        default_output_uri: ""

        # Default auto generated requirements optimize for smaller requirements
        # If True, analyze the entire repository regardless of the entry point.
        # If False, first analyze the entry point script, if it does not contain other to local files,
        # do not analyze the entire repository.
        force_analyze_entire_repo: false

        # If set to true, *clearml* update message will not be printed to the console
        # this value can be overwritten with os environment variable CLEARML_SUPPRESS_UPDATE_MESSAGE=1
        suppress_update_message: false

        # If this flag is true (default is false), instead of analyzing the code with Pigar, analyze with `pip freeze`
        detect_with_pip_freeze: false

        # Log specific environment variables. OS environments are listed in the "Environment" section
        # of the Hyper-Parameters.
        # multiple selected variables are supported including the suffix '*'.
        # For example: "AWS_*" will log any OS environment variable starting with 'AWS_'.
        # This value can be overwritten with os environment variable CLEARML_LOG_ENVIRONMENT="[AWS_*, CUDA_VERSION]"
        # Example: log_os_environments: ["AWS_*", "CUDA_VERSION"]
        log_os_environments: []

        # Development mode worker
        worker {
            # Status report period in seconds
            report_period_sec: 2

            # The number of events to report
            report_event_flush_threshold: 100

            # ping to the server - check connectivity
            ping_period_sec: 30

            # Log all stdout & stderr
            log_stdout: true

            # Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
            # Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
            console_cr_flush_period: 10

            # compatibility feature, report memory usage for the entire machine
            # default (false), report only on the running process and its sub-processes
            report_global_mem_used: false
        }
    }

    # Apply top-level environment section from configuration into os.environ
    apply_environment: false
    # Top-level environment section is in the form of:
    #   environment {
    #     key: value
    #     ...
    #   }
    # and is applied to the OS environment as `key=value` for each key/value pair

    # Apply top-level files section from configuration into local file system
    apply_files: false
    # Top-level files section allows auto-generating files at designated paths with a predefined contents
    # and target format. Options include:
    #  contents: the target file's content, typically a string (or any base type int/float/list/dict etc.)
    #  format: a custom format for the contents. Currently supported value is `base64` to automatically decode a
    #          base64-encoded contents string, otherwise ignored
    #  path: the target file's path, may include ~ and inplace env vars
    #  target_format: format used to encode contents before writing into the target file. Supported values are json,
    #                 yaml, yml and bytes (in which case the file will be written in binary mode). Default is text mode.
    #  overwrite: overwrite the target file in case it exists. Default is true.
    #
    # Example:
    #   files {
    #     myfile1 {
    #       contents: "The quick brown fox jumped over the lazy dog"
    #       path: "/tmp/fox.txt"
    #     }
    #     myjsonfile {
    #       contents: {
    #         some {
    #           nested {
    #             value: [1, 2, 3, 4]
    #           }
    #         }
    #       }
    #       path: "/tmp/test.json"
    #       target_format: json
    #     }
    #   }
}

you will find at the start the two server configurations I switch between, one for the local and one of the app-clearml

jkhenning commented 1 year ago

And can you do curl http://localhost:8008/debug.ping -u "<key>:<secret>" from the same machine? (key and secret are the values you have in the API section)

abfshaal commented 1 year ago

I get this result {"meta":{"id":"1b92f40a693b4ecbbff201cfc17911df","trx":"1b92f40a693b4ecbbff201cfc17911df","endpoint":{"name":"debug.ping","requested_version":"2.24","actual_version":"1.0"},"result_code":200,"result_subcode":0,"result_msg":"OK","error_stack":"","error_data":{}},"data":{"msg":"ClearML server"}}%

abfshaal commented 1 year ago

Update, it seems the issue happens when the agent is on the same machine as the clearml server deployment, I tried to deploy the clearml server on a virtual linux machine, and I started the agent on my local machine, things worked fine.

when both are on the same machine, it feels as if the clearml deployment and the docker container can't communicate with each other for some reason.

Could this have to do with the usage of localhost inside the clearml.conf? I tried to also add the argument --network=clearml_backend as I thought the network could be the issue here, no luck aswell.

jkhenning commented 1 year ago

Did you run the curl request from the same machine that has the server deployment? (that's what I meant you to do when I asked about it)

abfshaal commented 1 year ago

yup sounds right, If i go in to the docker container that gets launched upon queueing a task, the ping command returns curl: (7) Failed to connect to localhost port 8008: Connection refused

jkhenning commented 1 year ago

Yeah, so something's in the way...

abfshaal commented 1 year ago

figured it out! to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it! as this gives the docker container that gets launched connection to localhost services

abfshaal commented 1 year ago

do you think this is something that can be added to the documentation? I feel like I have seen atleast 2 other issues relating to the same thing. Feel free to close the issue with resolved Many thanks for your help @jkhenning

jkhenning commented 1 year ago

Oh right, missed that 🙂 I'll see what we can do to add that 👍

jokokojote commented 1 year ago

I had the same issue and solved it thanks to @abfshaal. Seems like nothing about this was added to the docs yet. I would also suggest adding a hint about this because running everything on the same machine is not uncommon especially to try clearml as a newbie.

ainoam commented 1 year ago

Thanks for the note @jokokojote - would you care to push a PR?