allegroai / clearml-agent

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution
https://clear.ml/docs/
Apache License 2.0
235 stars 90 forks source link

Run in a docker mode not passing envs (DIND) #186

Open david-nano opened 8 months ago

david-nano commented 8 months ago

This is my docker-compose.yaml file:

version: '3.8'

services:
  clearml-agent:
    image: allegroai/clearml-agent:latest
    container_name: clearml-agent
    environment:
      - CLEARML_AGENT_GIT_USER=dcdevops
      - CLEARML_AGENT_GIT_PASS=***
      - CLEARML_API_ACCESS_KEY=***
      - CLEARML_API_SECRET_KEY=***
      - CLEARML_API_HOST=***
      - CLEARML_WEB_HOST=***
      - CLEARML_FILES_HOST=***
      - CLEARML_WORKER_NAME="local-agent"
      - CLEARML_AGENT_EXTRA_DOCKER_ARGS="-v /root/clearml.conf:/root/clearml.conf:ro -v /root/.ssh:/root/.ssh:ro"
      - CLEARML_DOCKER_IMAGE="python:3.8-slim"
      - CLEARML_AGENT__AGENT__DEFAULT_DOCKER__IMAGE="python:3.8-slim"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /home/dcdevops/clearml-agent/clearml.conf:/root/clearml.conf:ro
      - /home/dcdevops/clearml-agent/clearml.conf:/opt/clearml/agent.default.conf:ro
      - /home/dcdevops/.ssh/:/root/.ssh/:ro

But when I'm trying to send jobs I'm getting error:

docker: invalid reference format.
See 'docker run --help'.

Seem that the trigger line is brake it since:

'-e', 'CLEARML_DOCKER_IMAGE=',

Which can cause this brake. All the ENVS I passed not affect it, so I've tried to change the clearml.conf and add agent section:

agent {
    # Set GIT user/pass credentials (if user/pass are set, GIT protocol will be set to https)
    git_user=""
    git_pass=""
    # all other domains will use public access (no user/pass). Default: always send user/pass for any VCS domain
    git_host="https://gitlab.domain.local/"

    # Force GIT protocol to use SSH regardless of the git url (Assumes GIT user/pass are blank)
    force_git_ssh_protocol: true

    # unique name of this worker, if None, created based on hostname:process_id
    # Overridden with os environment: CLEARML_WORKER_NAME
    worker_id: "MyHost"
    extra_docker_arguments: ["-v /root/clearml.conf:/root/clearml.conf:ro", "-v /root/.ssh:/root/.ssh:ro" ]
    docker_allow_host_environ: true
    default_docker {
      image: "python:3.8-slim"
    }
}

But still the same result.

  1. How do I affect the image here and provide this environment?
  2. Do we need this environment at all? this is the container that triggered, why it need a variable?
  3. I got another issue with reading clearml.conf in the child docker, how we make sure they passed on, or how I can add an argument in the child docker to make sure pass -v clearml.conf:clearml.conf?
jkhenning commented 8 months ago

Hi @david-nano,

It seems your docker-compose version does not support the file format, you'll need to upgrade it

david-nano commented 8 months ago

It seems your docker-compose version does not support the file format, you'll need to upgrade it

Hi @david-nano,

It seems your docker-compose version does not support the file format, you'll need to upgrade it

How it's related the docker-compose? agent compose is with last version (3.8), and inside the agent it's ClearML image, so I don't have control of it: allegroai/clearml-agent:latest