allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
364 stars 132 forks source link

Bug in clearml-server 1.10, apiserver port configuration ignored #191

Closed qraleq closed 11 months ago

qraleq commented 1 year ago

There is an issue with the latest version of clearml-server where the port configuration on apiserver is being overridden, causing the server to use the wrong port. As a result, the server fails to log in because it is sending requests to the incorrect port. Despite changing the port to 10008, the debug console shows that requests are still being sent to the wrong port. This problem does not occur when the version is explicitly set to 1.9.

services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
    ports:
    - "10008:8008"
    networks:
      - backend
      - frontend

How it looks in the debug console:

image

jkhenning commented 1 year ago

Hi @qraleq, I'm not sure regarding 1.9, but unless the webserver is explicitly configured to communicate with the apiserver on the correct port, I don't think changing this line (- "10008:8008" in the apiserver service) in the docker compose for 1.9 would work as well...

Are you sure this is the only change you've made when using 1.9?

qraleq commented 1 year ago

Hi @jkhenning , this was the only change I did.


version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:1.9
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
      CLEARML__services__async_urls_delete__enabled: "true"
    ports:
    - "10008:8008"
    networks:
      - backend
      - frontend

  elasticsearch:
    networks:
      - backend
    container_name: clearml-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
      ELASTIC_PASSWORD: ${ELASTIC_PASSWORD}
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: clearml
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.17.7
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs

  fileserver:
    networks:
      - backend
      - frontend
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:1.9
    environment:
      CLEARML__fileserver__delete__allow_batch: "true"
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/data/fileserver:/mnt/fileserver
    - /opt/clearml/config:/opt/clearml/config
    ports:
    - "10081:8081"

  mongo:
    networks:
      - backend
    container_name: clearml-mongo
    image: mongo:4.4.9
    restart: unless-stopped
    command: --setParameter internalQueryMaxBlockingSortMemoryUsageBytes=196100200
    volumes:
    - /opt/clearml/data/mongo_4/db:/data/db
    - /opt/clearml/data/mongo_4/configdb:/data/configdb

  redis:
    networks:
      - backend
    container_name: clearml-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/clearml/data/redis:/data

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    # environment:
    #  CLEARML_SERVER_SUB_PATH : clearml-web # Allow Clearml to be served with a URL path prefix.
    image: allegroai/clearml:1.9
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "10080:80"
    networks:
      - backend
      - frontend

  async_delete:
    depends_on:
      - apiserver
      - redis
      - mongo
      - elasticsearch
      - fileserver
    container_name: async_delete
    image: allegroai/clearml:1.9
    networks:
      - backend
    restart: unless-stopped
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_ELASTIC_SERVICE_PASSWORD: ${ELASTIC_PASSWORD}
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      PYTHONPATH: /opt/clearml/apiserver
      CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
    entrypoint:
      - python3
      - -m
      - jobs.async_urls_delete
      - --fileserver-host
      - http://fileserver:8081
    volumes:
      - /opt/clearml/logs:/var/log/clearml

  agent-services:
    networks:
      - backend
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    deploy:
      restart_policy:
        condition: on-failure
    privileged: true
    environment:
      CLEARML_HOST_IP: XXXXXXXXXXX
      CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
      CLEARML_API_HOST: http://apiserver:8008
      CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
      CLEARML_API_ACCESS_KEY: XXXXXXXXXXXXXXXX
      CLEARML_API_SECRET_KEY: XXXXXXXXXXXXXXXX
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/clearml/agent:/root/.clearml
    depends_on:
      - apiserver
    entrypoint: >
      bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused 'http://apiserver:8008/debug.ping' && /usr/agent/entrypoint.sh"

networks:
  backend:
    driver: bridge
  frontend:
    driver: bridge
jkhenning commented 1 year ago

Oh, but you changed it in more than one service, right?

qraleq commented 1 year ago

Yes, that's true!

jkhenning commented 1 year ago

Also, are you sure you do not have any specific configuration (in your /opt/clearml/config folder, perhaps?) for the webserver?

qraleq commented 1 year ago

This is the config/apiserver.conf where I just removed the user info. And that's the only change I did to this file.


    watch: false            # Watch for changes (dev only)
    debug: false            # Debug mode
    pretty_json: false      # prettify json response
    return_stack: true      # return stack trace on error
    return_stack_to_caller: true # top-level control on whether to return stack trace in an API response

    # if 'return_stack' is true and error contains a status code, return stack trace only for these status codes
    # valid values are:
    #  - an integer number, specifying a status code
    #  - a tuple of (code, subcode or list of subcodes)
    return_stack_on_code: [
        [500, 0]  # raise on internal server error with no subcode
    ]

    listen {
        ip : "0.0.0.0"
        port: 8008
    }

    version {
        required: false
        default: 1.0
        # if set then calls to endpoints with the version
        # greater that the current max version will be rejected
        check_max_version: false
    }

    pre_populate {
        enabled: false
        zip_files: ["/path/to/export.zip"]
        fail_on_error: false
        # artifacts_path: "/mnt/fileserver"
    }

    # time in seconds to take an exclusive lock to init es and mongodb
    # not including the pre_populate
    db_init_timout: 120

    mongo {
        # controls whether FieldDoesNotExist exception will be raised for any extra attribute existing in stored data
        # but not declared in a data model
        strict: false

        aggregate {
            allow_disk_use: true
        }
    }

    elastic {
        probing {
            # settings for inital probing of elastic connection
            max_retries: 4
            timeout: 30
        }
        upgrade_monitoring {
            v16_migration_verification: true
        }
    }

    auth {
        # verify user tokens
        verify_user_tokens: false

        # max token expiration timeout in seconds (1 year)
        max_expiration_sec: 31536000

        # default token expiration timeout in seconds (30 days)
        default_expiration_sec: 2592000

        # cookie containing auth token, for requests arriving from a web-browser
        session_auth_cookie_name: "clearml_token_basic"

        # cookie configuration for authorization cookies generated by auth.login
        cookies {
            httponly: false  # allow only http to access the cookies (no JS etc)
            secure: true   # not using HTTPS
            domain: null    # Limit to localhost is not supported
            max_age: 99999999999
        }

        # provide a cookie domain override per company
#        cookies_domain_override {
#            <company-id>: <domain>
#        }

        fixed_users {
            enabled: true
            pass_hashed: true
            users: [

            ]
        }

    }

    cors {
        origins: "*"

        # Not supported when origins is "*"
        supports_credentials: true
    }

    default_company: "d1bd92a3b039400cbafc60a7a5b1e52b"

    workers {
        # Auto-register unknown workers on status reports and other calls
        auto_register: true
        # Assume unknow workers have unregistered (i.e. do not raise unregistered error)
        auto_unregister: true
        # Timeout in seconds on task status update. If exceeded
        # then task can be stopped without communicating to the worker
        task_update_timeout: 600
    }

    check_for_updates {
        enabled: true

        # Check for updates every 24 hours
        check_interval_sec: 86400

        url: "https://updates.clear.ml/updates"

        component_name: "clearml-server"

        # GET request timeout
        request_timeout_sec: 3.0
    }

    statistics {
        # Note: statistics are sent ONLY if the user has actively opted-in
        supported: true

        url: "https://updates.clear.ml/stats"

        report_interval_hours: 24
        agent_relevant_threshold_days: 30

        max_retries: 5
        max_backoff_sec: 5
    }

}
jkhenning commented 1 year ago

OK, that doesn't seem related. Let me check with the guys 🙂

jkhenning commented 1 year ago

Do you have a setup with 1.9 that's still working? If so, can you check and see what's the URL used by the webapp to make these calls?

qraleq commented 1 year ago

Yes, I have a v1.9.0 server that is running without any issues, the requests are sent to https://ADDRESS:9080/api/v2.23/login.supported_modes where 9080 is the nginx proxied port.

jkhenning commented 1 year ago

Why 9080 when the setting is 10080?

qraleq commented 1 year ago

This is the nginx config:


server {

    listen 9080;
    server_name XXXXX;

    ssl_certificate           /etc/letsencrypt/live/XXXXX/fullchain.pem;
    ssl_certificate_key       /etc/letsencrypt/live/XXXXX/privkey.pem;

    ssl on;
    ssl_session_cache  builtin:1000  shared:SSL:10m;
    ssl_protocols  TLSv1 TLSv1.1 TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!eNULL:!EXPORT:!CAMELLIA:!DES:!MD5:!PSK:!RC4;
    ssl_session_tickets on;
    ssl_session_timeout 8h;

    access_log            /var/log/nginx/clearml_app.access.log;
    error_log            /var/log/nginx/clearml_app.error.log;

    location / {

      proxy_set_header        Host $host;
      proxy_set_header        X-Real-IP $remote_addr;
      proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header        X-Forwarded-Proto $scheme;

      # Fix the It appears that your reverse proxy set up is broken" error.
      proxy_pass          http://localhost:10080;
      proxy_read_timeout  90;

      proxy_ssl_server_name on;

      proxy_redirect      https://localhost:10080 https://ml.forsight.ai:10080;
    }
  }
jkhenning commented 1 year ago

Is this something you changed as well? Or is it your own nginx running on top of the ClearML server?

qraleq commented 1 year ago

This is my nginx running on top of ClearML server. It's running fine with version 1.9.0.

jkhenning commented 1 year ago

@qraleq we've identified the issue and will release a patch release today or tomorrow 🙂

qraleq commented 1 year ago

That's great 👍🏽 Do you mind sharing what the issue is?

oren-allegro commented 1 year ago

Hi @qraleq. This was a change that was inserted by mistakes, that causes the webserver to use the default apiserver's port, instead of using the reverse-proxy, by accessing /api url. We fixed the issue, and will release a version shortly.

qraleq commented 1 year ago

Perfect, thank you for the explanation and fast fix! Best regards!

oren-allegro commented 1 year ago

@qraleq - version 1.10.1 was just released, and should fix this. Please pull the new image (https://github.com/allegroai/clearml-server#upgrading-) and let us know if this fixed your issue

jkhenning commented 11 months ago

@qraleq closing this. Please reopen if required.