allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
381 stars 131 forks source link

trains-agent : how does it work ? #64

Open sdesrozis opened 3 years ago

sdesrozis commented 3 years ago

Hi, I've deployed a trains server using docker. I used docker-compose.yml script. It works perfectly fine and it is awesome.

Now I would like to use cleanup service but I can't get it work atm... First, I had to fix proxy issues by adding some http_proxy env vars. Second, I had to hard code TRAINS_API_HOST in compose because it seems something wrong happened. Now, I have the following error log from docker

http://10.4.0.10:8081 http://10.4.0.10:8080 http://10.4.0.10:8008
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

trains_agent: ERROR: Failed getting token (error 504 from http://10.4.0.10:8008): Gateway Timeout

http://10.4.0.10:8081 http://10.4.0.10:8080 http://10.4.0.10:8008
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

trains_agent: ERROR: Failed getting token (error 504 from http://10.4.0.10:8008): Gateway Timeout

http://10.4.0.10:8081 http://10.4.0.10:8080 http://10.4.0.10:8008
WARNING: You are using pip version 20.1.1; however, version 20.2.3 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
Failed creating temporary copy of ~/.ssh for git credential

What did I miss ? Help is welcome ! Thanks in advance.

NB: Trains-server is from github sources.

jkhenning commented 3 years ago

Hi @sdesrozis,

It looks like the Trains Agent simply can't reach the Trains Server apiserver component (using port 8008). Generating a token is the first thing the Trains Agent does when starting to communicate with the server. Is 10.4.0.10 the value you've set up for the TRAINS_API_HOST env var? Can you perhaps share the docker-compose file?

sdesrozis commented 3 years ago

Sory for the lag...

My compose

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: trains-apiserver
    image: allegroai/trains:latest
    restart: unless-stopped
    volumes:
    - /opt/trains/logs:/var/log/trains
    - /opt/trains/config:/opt/trains/config
    - /opt/trains/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      TRAINS_ELASTIC_SERVICE_HOST: elasticsearch
      TRAINS_ELASTIC_SERVICE_PORT: 9200
      TRAINS_MONGODB_SERVICE_HOST: mongo
      TRAINS_MONGODB_SERVICE_PORT: 27017
      TRAINS_REDIS_SERVICE_HOST: redis
      TRAINS_REDIS_SERVICE_PORT: 6379
      TRAINS_SERVER_DEPLOYMENT_TYPE: ${TRAINS_SERVER_DEPLOYMENT_TYPE:-linux}
      TRAINS__apiserver__pre_populate__enabled: "true"
      TRAINS__apiserver__pre_populate__zip_files: "/opt/trains/db-pre-populate"
      TRAINS__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      HTTP_PROXY: http://irproxy:8082
      HTTPS_PROXY: http://irproxy:8082
      http_proxy: http://irproxy:8082
      https_proxy: http://irproxy:8082
      no_proxy: .ifp.fr,127.0.0.1,apiserver,.ifpen.fr,digitalsandbox
      NO_PROXY: .ifp.fr,127.0.0.1,apiserver,.ifpen.fr,digitalsandbox
    ports:
    - "8008:8008"
    networks:
      - backend

  elasticsearch:
    networks:
      - backend
    container_name: trains-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g
      bootstrap.memory_lock: "true"
      cluster.name: trains
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: trains
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
    restart: unless-stopped
    volumes:
    - /opt/trains/data/elastic_7:/usr/share/elasticsearch/data
    ports:
    - "9200:9200"

  fileserver:
    networks:
      - backend
    command:
    - fileserver
    container_name: trains-fileserver
    image: allegroai/trains:latest
    restart: unless-stopped
    volumes:
    - /opt/trains/logs:/var/log/trains
    - /opt/trains/data/fileserver:/mnt/fileserver
    - /opt/trains/config:/opt/trains/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: trains-mongo
    image: mongo:3.6.5
    restart: unless-stopped
    command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
    volumes:
    - /opt/trains/data/mongo/db:/data/db
    - /opt/trains/data/mongo/configdb:/data/configdb
    ports:
    - "27017:27017"

  redis:
    networks:
      - backend
    container_name: trains-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/trains/data/redis:/data
    ports:
    - "6379:6379"

  webserver:
    command:
    - webserver
    container_name: trains-webserver
    image: allegroai/trains:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8080:80"

  agent-services:
    networks:
      - backend
    container_name: trains-agent-services
    image: allegroai/trains-agent-services:latest
    restart: unless-stopped
    privileged: true
    environment:
      TRAINS_HOST_IP: ${TRAINS_HOST_IP}
      TRAINS_WEB_HOST: ${TRAINS_WEB_HOST:-}
      TRAINS_API_HOST: http://10.4.0.10:8008
      TRAINS_FILES_HOST: ${TRAINS_FILES_HOST:-}
      TRAINS_API_ACCESS_KEY: ${TRAINS_API_ACCESS_KEY:-}
      TRAINS_API_SECRET_KEY: ${TRAINS_API_SECRET_KEY:-}
      TRAINS_AGENT_GIT_USER: ${TRAINS_AGENT_GIT_USER}
      TRAINS_AGENT_GIT_PASS: ${TRAINS_AGENT_GIT_PASS}
      TRAINS_AGENT_UPDATE_VERSION: ${TRAINS_AGENT_UPDATE_VERSION:->=0.15.0}
      TRAINS_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      TRAINS_WORKER_ID: "trains-services"
      TRAINS_AGENT_DOCKER_HOST_MOUNT: "/opt/trains/agent:/root/.trains"
      HTTP_PROXY: http://irproxy:8082
      HTTPS_PROXY: http://irproxy:8082
      http_proxy: http://irproxy:8082
      https_proxy: http://irproxy:8082
      no_proxy: .ifp.fr,127.0.0.1,apiserver,.ifpen.fr,digitalsandbox
      NO_PROXY: .ifp.fr,127.0.0.1,apiserver,.ifpen.fr,digitalsandbox
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/trains/agent:/root/.trains
    depends_on:
      - apiserver

networks:
  backend:
    driver: bridge

TRAINS_HOST_IP is set to 10.4.0.10.

Thanks for your help. I don't catch what I do wrong...

jkhenning commented 3 years ago

Hi @sdesrozis,

From the shell (after running the server) - can you reach http://10.4.0.10:8008?

For example, when is the response you get when you try curl -XGET http://10.4.0.10:8008?

sdesrozis commented 3 years ago

Sounds not good

{"meta":{"id":"3e1894edc488431895229ca3c13ce419","trx":"3e1894edc488431895229ca3c13ce419","endpoint":{"name":"","requested_version":1.0,"actual_version":null},"result_code":400,"result_subcode":0,"result_msg":"Invalid request path /","error_stack":null},"data":{}}

but http://10.4.0.10:8080 and others are reachable.

The docker stack seems good

a01bd59217f5        allegroai/trains-agent-services:latest                "/usr/agent/entrypoi…"   6 days ago          Up 6 days                                                           trains-agent-services
6efcb4143380        allegroai/trains:latest                               "/opt/trains/wrapper…"   6 days ago          Up 6 days           8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp   trains-webserver
3235a05ddc02        allegroai/trains:latest                               "/opt/trains/wrapper…"   6 days ago          Up 6 days           0.0.0.0:8008->8008/tcp, 8080-8081/tcp           trains-apiserver
1e6a354ecc4f        mongo:3.6.5                                           "docker-entrypoint.s…"   6 days ago          Up 6 days           0.0.0.0:27017->27017/tcp                        trains-mongo
1eb82bc66ea7        redis:5.0                                             "docker-entrypoint.s…"   6 days ago          Up 6 days           0.0.0.0:6379->6379/tcp                          trains-redis
ffe8da4d4932        docker.elastic.co/elasticsearch/elasticsearch:7.6.2   "/usr/local/bin/dock…"   6 days ago          Up 6 days           0.0.0.0:9200->9200/tcp, 9300/tcp                trains-elastic
a470dc715e7a        allegroai/trains:latest                               "/opt/trains/wrapper…"   6 days ago          Up 6 days           8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp      trains-fileserver
jkhenning commented 3 years ago

The response looks good, actually (it's an error, but the one I'd expect 🙂).

However, this means that the same address is not reachable from within the agent-services container...

sdesrozis commented 3 years ago

I check asap if the curl command works inside the container.

EDIT : it could be a collision between IPs provided by IT and IPs considered by docker. Let’s see.