allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
378 stars 132 forks source link

Self Hosted ClearMl Server - No option to create credentials, Web UI seems faulty #72

Open chrisorm opened 3 years ago

chrisorm commented 3 years ago

Hi I followed the instructions on the install page - installed using docker on Ubuntu. All steps worked and no errors reported. I installed it today, so presumably is a recent version, but im not sure how to tell specifically for the server.

If I access the profile page I see:

image

Unlike the clearml hosted webui pages, theres no navigation options either, any way i access the web ui (for example no user icon top right, no nav bar down the lefthand side).

Any ideas?

jkhenning commented 3 years ago

Hi @chrisorm, this looks like the WebApp can't reach the server. Can you open the developer tools panel (F12), go to the network section, reload the page and share what appears in the network section list?

amedyukhina commented 3 years ago

Hi, I am having the same problem.

In the developer tools panel, I am getting the following errors:

POST <server_address>:8080/api/v2.12/login.supported_modes 502 (Bad Gateway)  zone.js:2843
POST <server_address>:8080/api/v2.12/users.get_preferences 502 (Bad Gateway)  zone.js:2843
POST <server_address>:8080/api/v2.12/users.get_current_user 502 (Bad Gateway)  zone.js:2843
jkhenning commented 3 years ago

Hi @amedyukhina,

The error (502 Bad Gateway) indicates the browser can't reach the server's API endpoints at all

Where is your server running? Are you running the web server on the same machine, or on a machine on the same network?

hadyan-tvlk commented 3 years ago

Facing similar issue @jkhenning,

~I'm trying to deploy the platform in my local laptop. I'm following exactly from this tutorial: https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_linux_mac.html#deploying~

~Anything i missed here? should we modify the docker-compose.yml file? thanks in advance!~

Somehow after redeploy the server, it works well. Not sure why?

jkhenning commented 3 years ago

Well, it might be that the apiserver component took some time to boot, and they UI simply could not reach it

chrisorm commented 3 years ago

Sorry for the delay - in my case, as it was running inside a vm, I think it was taking a long time to start - increase VM RAM and restarting fixed the issue.

amedyukhina commented 3 years ago

Hi @jkhenning,

I was accessing from a different network via VPN. This used to work before, but I had to reinstall the clearML server after a system upgrade (from RHEL7 to RHEL8, if this is important).

I have also tried to access the server from the same machine it is running on, and I get the same error.

jkhenning commented 3 years ago

@amedyukhina did you try curl http://localhost:8008 from the same machine? What's the output?

amedyukhina commented 3 years ago

It says curl: (7) Failed to connect to localhost port 8008: Connection refused It seems like nothing is running there.

jkhenning commented 3 years ago

This is from the server machine? If so, it indeed indicates the server is not up.

Can you do sudo docker ps? I assume you're using docker-compose to run the server?

amedyukhina commented 3 years ago

Yes, this was from the server machine. I am running the server with docker-compose.

Here is the output of sudo docker ps

CONTAINER ID   IMAGE                                                 COMMAND                  CREATED      STATUS                          PORTS                                                            NAMES
1e11da6ad4c5   allegroai/clearml-agent-services:latest               "/usr/agent/entrypoi…"   3 days ago   Up 14 seconds                                                                                    clearml-agent-services
ade29d3480cb   allegroai/clearml:latest                              "/opt/trains/wrapper…"   3 days ago   Up 3 days                       8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp   clearml-webserver
a28647d69e5f   allegroai/clearml:latest                              "/opt/trains/wrapper…"   3 days ago   Restarting (1) 19 seconds ago                                                                    clearml-apiserver
34f22478e3bb   docker.elastic.co/elasticsearch/elasticsearch:7.6.2   "/usr/local/bin/dock…"   3 days ago   Up 3 days                       9200/tcp, 9300/tcp                                               clearml-elastic
059e88709844   redis:5.0                                             "docker-entrypoint.s…"   3 days ago   Up 3 days                       6379/tcp                                                         clearml-redis
ff3fbd91dfdf   mongo:3.6.5                                           "docker-entrypoint.s…"   3 days ago   Up 3 days                       27017/tcp                                                        clearml-mongo
78870437088f   allegroai/clearml:latest                              "/opt/trains/wrapper…"   3 days ago   Up 3 days                       8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp    clearml-fileserver
jkhenning commented 3 years ago

So it seems your clearml-apiserver container keeps restarting. Did you use any special configuration or made any changes to the docker-compose file? Can you include the output of sudo docker logs clearml-apiserver?

amedyukhina commented 3 years ago

I have followed these instructions to install the clearML server.

I am getting a "Connection refused" error as a response to sudo docker logs clearml-apiserver

Here is the full output:

Loading config from /opt/trains/apiserver/config/default
Loading config from file /opt/trains/apiserver/config/default/apiserver.conf
Loading config from file /opt/trains/apiserver/config/default/hosts.conf
Loading config from file /opt/trains/apiserver/config/default/logging.conf
Loading config from file /opt/trains/apiserver/config/default/secure.conf
Loading config from file /opt/trains/apiserver/config/default/services/auth.conf
Loading config from file /opt/trains/apiserver/config/default/services/events.conf
Loading config from file /opt/trains/apiserver/config/default/services/organization.conf
Loading config from file /opt/trains/apiserver/config/default/services/projects.conf
Loading config from file /opt/trains/apiserver/config/default/services/tasks.conf
Loading config from /opt/trains/config
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 559, in connect
    sock = self._connect()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 615, in _connect
    raise err
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 603, in _connect
    sock.connect(socket_address)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/trains/apiserver/server.py", line 6, in <module>
    from apiserver.server_init.app_sequence import AppSequence
  File "/opt/trains/apiserver/server_init/app_sequence.py", line 10, in <module>
    from apiserver.bll.statistics.stats_reporter import StatisticsReporter
  File "/opt/trains/apiserver/bll/statistics/stats_reporter.py", line 30, in <module>
    worker_bll = WorkerBLL()
  File "/opt/trains/apiserver/bll/workers/__init__.py", line 38, in __init__
    self.redis = redis or redman.connection("workers")
  File "/opt/trains/apiserver/redis_manager.py", line 176, in connection
    obj.get("health")
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 1606, in get
    return self.execute_command('GET', name)
  File "/usr/local/lib/python3.6/site-packages/redis/client.py", line 898, in execute_command
    conn = self.connection or pool.get_connection(command_name, **options)
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 1192, in get_connection
    connection.connect()
  File "/usr/local/lib/python3.6/site-packages/redis/connection.py", line 563, in connect
    raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting to 127.0.0.1:6379. Connection refused.
jkhenning commented 3 years ago

@amedyukhina this is very strange - can you share your docker-compose.yml file?

amedyukhina commented 3 years ago

Here it is:

cat /opt/clearml/docker-compose.yml

version: "3.6"
services:

  apiserver:
    command:
    - apiserver
    container_name: clearml-apiserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/config:/opt/clearml/config
    - /opt/clearml/data/fileserver:/mnt/fileserver
    depends_on:
      - redis
      - mongo
      - elasticsearch
      - fileserver
    environment:
      CLEARML_ELASTIC_SERVICE_HOST: elasticsearch
      CLEARML_ELASTIC_SERVICE_PORT: 9200
      CLEARML_MONGODB_SERVICE_HOST: mongo
      CLEARML_MONGODB_SERVICE_PORT: 27017
      CLEARML_REDIS_SERVICE_HOST: redis
      CLEARML_REDIS_SERVICE_PORT: 6379
      CLEARML_SERVER_DEPLOYMENT_TYPE: ${CLEARML_SERVER_DEPLOYMENT_TYPE:-linux}
      CLEARML__apiserver__pre_populate__enabled: "true"
      CLEARML__apiserver__pre_populate__zip_files: "/opt/clearml/db-pre-populate"
      CLEARML__apiserver__pre_populate__artifacts_path: "/mnt/fileserver"
    ports:
    - "8008:8008"
    networks:
      - backend
      - frontend

  elasticsearch:
    networks:
      - backend
    container_name: clearml-elastic
    environment:
      ES_JAVA_OPTS: -Xms2g -Xmx2g
      bootstrap.memory_lock: "true"
      cluster.name: clearml
      cluster.routing.allocation.node_initial_primaries_recoveries: "500"
      cluster.routing.allocation.disk.watermark.low: 500mb
      cluster.routing.allocation.disk.watermark.high: 500mb
      cluster.routing.allocation.disk.watermark.flood_stage: 500mb
      discovery.zen.minimum_master_nodes: "1"
      discovery.type: "single-node"
      http.compression_level: "7"
      node.ingest: "true"
      node.name: clearml
      reindex.remote.whitelist: '*.*'
      xpack.monitoring.enabled: "false"
      xpack.security.enabled: "false"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2
    restart: unless-stopped
    volumes:
      - /opt/clearml/data/elastic_7:/usr/share/elasticsearch/data
      - /usr/share/elasticsearch/logs

  fileserver:
    networks:
      - backend
      - frontend
    command:
    - fileserver
    container_name: clearml-fileserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    volumes:
    - /opt/clearml/logs:/var/log/clearml
    - /opt/clearml/data/fileserver:/mnt/fileserver
    - /opt/clearml/config:/opt/clearml/config
    ports:
    - "8081:8081"

  mongo:
    networks:
      - backend
    container_name: clearml-mongo
    image: mongo:3.6.5
    restart: unless-stopped
    command: --setParameter internalQueryExecMaxBlockingSortBytes=196100200
    volumes:
    - /opt/clearml/data/mongo/db:/data/db
    - /opt/clearml/data/mongo/configdb:/data/configdb

  redis:
    networks:
      - backend
    container_name: clearml-redis
    image: redis:5.0
    restart: unless-stopped
    volumes:
    - /opt/clearml/data/redis:/data

  webserver:
    command:
    - webserver
    container_name: clearml-webserver
    image: allegroai/clearml:latest
    restart: unless-stopped
    depends_on:
      - apiserver
    ports:
    - "8080:80"
    networks:
      - backend
      - frontend

  agent-services:
    networks:
      - backend
    container_name: clearml-agent-services
    image: allegroai/clearml-agent-services:latest
    restart: unless-stopped
    privileged: true
    environment:
      CLEARML_HOST_IP: ${CLEARML_HOST_IP}
      CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
      CLEARML_API_HOST: http://apiserver:8008
      CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
      CLEARML_API_ACCESS_KEY: ${CLEARML_API_ACCESS_KEY:-}
      CLEARML_API_SECRET_KEY: ${CLEARML_API_SECRET_KEY:-}
      CLEARML_AGENT_GIT_USER: ${CLEARML_AGENT_GIT_USER}
      CLEARML_AGENT_GIT_PASS: ${CLEARML_AGENT_GIT_PASS}
      CLEARML_AGENT_UPDATE_VERSION: ${CLEARML_AGENT_UPDATE_VERSION:->=0.17.0}
      CLEARML_AGENT_DEFAULT_BASE_DOCKER: "ubuntu:18.04"
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-}
      AWS_DEFAULT_REGION: ${AWS_DEFAULT_REGION:-}
      AZURE_STORAGE_ACCOUNT: ${AZURE_STORAGE_ACCOUNT:-}
      AZURE_STORAGE_KEY: ${AZURE_STORAGE_KEY:-}
      GOOGLE_APPLICATION_CREDENTIALS: ${GOOGLE_APPLICATION_CREDENTIALS:-}
      CLEARML_WORKER_ID: "clearml-services"
      CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/clearml/agent:/root/.clearml
    depends_on:
      - apiserver

networks:
  backend:
    driver: bridge
  frontend:
    driver: bridge
jkhenning commented 3 years ago

Well, it seems like you are using the latest docker-compose.yml, but I think your docker images are from older versions (i.e. 0.17/0 and below).

The best thing to try is to pull the new docker images and start the server up again - try doing:

sudo docker-compose -f docker-compose.yml down
sudo docker-compose -f docker-compose.yml pull
sudo docker-compose -f docker-compose.yml up -d
amedyukhina commented 3 years ago

It is working now. Thank you so much!

LightManxx commented 3 years ago

Well, it seems like you are using the latest docker-compose.yml, but I think your docker images are from older versions (i.e. 0.17/0 and below).

The best thing to try is to pull the new docker images and start the server up again - try doing:

sudo docker-compose -f docker-compose.yml down
sudo docker-compose -f docker-compose.yml pull
sudo docker-compose -f docker-compose.yml up -d

Hi @jkhenning ,I have the same proplem and my docker-compose.yml file is exactly the same as @amedyukhina 's ,but this solution is not working for me.

jkhenning commented 3 years ago

@LightManxx what errors are you getting?

zylprivate commented 2 years ago

@jkhenning I meet 8008 connection refuse. And this is the docker ps output:

CONTAINER ID        IMAGE                                                  COMMAND                  CREATED             STATUS                          PORTS                                           NAMES
d3470f05547a        allegroai/clearml:latest                               "/opt/clearml/wrap..."   About an hour ago   Up About an hour                8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp   clearml-webserver
db3bc1c8fbe3        allegroai/clearml:latest                               "/opt/clearml/wrap..."   About an hour ago   Up 52 seconds                   0.0.0.0:8008->8008/tcp, 8080-8081/tcp           clearml-apiserver
4e3312726ebe        docker.elastic.co/elasticsearch/elasticsearch:7.16.2   "/bin/tini -- /usr..."   About an hour ago   Restarting (1) 21 minutes ago                                                   clearml-elastic
5611d479324a        redis:5.0                                              "docker-entrypoint..."   About an hour ago   Up About an hour                6379/tcp                                        clearml-redis
a08e39fe7972        mongo:4.4.9                                            "docker-entrypoint..."   About an hour ago   Up About an hour                27017/tcp                                       clearml-mongo
657ba11ee759        allegroai/clearml:latest                               "/opt/clearml/wrap..."   About an hour ago   Up About an hour                8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp      clearml-fileserver

what should I do? Thank you very much! The problem is memory is not enough.I suggest that the clearml deployment page points the require RAM and so on.

ainoam commented 2 years ago

@zylprivate Which deployment page were you following?

zylprivate commented 2 years ago

https://clear.ml/docs/latest/docs/deploying_clearml/clearml_server_linux_mac/ This page doesn't require the device's RAM limitation which I think it should be added.

jkhenning commented 2 years ago

@zylprivate the docker status indicates the elastic service is restarting, there's obviously something wrong - can you do sudo docker logs clearml-elastic and share the result?

zylprivate commented 2 years ago

I have known the reason for this problem before. It is because the memory is not enough(2GB RAM). So I suggest that add device requirements to the deployment page. Thanks for your reply.

jkhenning commented 2 years ago

Thanks @zylprivate, will do!

PaulZhangIsing commented 2 years ago

sometimes clearing web browser's cache may help