allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
385 stars 133 forks source link

clearml-agent-services container always restarting? #75

Open hadyan-tvlk opened 3 years ago

hadyan-tvlk commented 3 years ago

Dear ClearML community,

i have issue where the ClearML Server at some point (not sure when) always trigger restart for clearml-agent-services and never up and running. However, the rest of the services still up and running.

Screenshot 2021-05-07 at 7 07 46 PM

This event cause the ClearML Server is unreachable. Any idea? is this related to the infra config? Thanks in advance!

jkhenning commented 3 years ago

Hi @hadyan-tvlk,

I assume you're using the docker-compose deployment?

This event cause the ClearML Server is unreachable

What exactly do you mean by the ClearML Server being unreachable? Can you access the WebApp at port 8080? If not, what error are you seeing? It is unlikely that this is caused by the agent services as the agent services is simply a client of the server, and cannot affect the server itself or the WebApp.

hadyan-tvlk commented 3 years ago

Hi @jkhenning,

Yes correct

I assume you're using the docker-compose deployment?

Yes, i just can't open the Web UI and perform tracking. The solution is to restart the server and it came back to normal

What exactly do you mean by the ClearML Server being unreachable? Can you access the WebApp at port 8080? If not, what error are you seeing?

jkhenning commented 3 years ago

The next time it happens, can you do sudo docker ps on the server machine and share the output?

Also, it would be nice to see the output of you browser's Developer Tools' Network section when trying to access the Web UI (when you fail to open it).

ecm200 commented 3 years ago

I am having the same issue, with the agent-services container continually restarting.

I have installed clearml server on an Azure VM, running on Ubuntu 18.04. It's a completely fresh machine and I can confirm that I have opened ports 8080, 8081 and 8008 on the VM.

The only modification I have made from following the basic installation guide, is to secure the web server by creating the apiserver.conf file in /opt/clearml/config and adding the following to secure the web interface:

auth {
     # Fixed users login credentials
     # No other user will be able to login
     fixed_users {
         enabled: true
         pass_hashed: false
         users: [
             {
                 username: "***********"
                 password: "***********"
                 name: "Ed Morris"
             },
             {
                 username: "**************"
                 password: "**************"
                 name: "Chris Musselle"
             },
         ]
     }
 }

Obviously, the actual username and passwords have been replaced.

The installation was performed using the docker-compose method, which followed from the documentation. I can access the web portal fine without issue.

Performing a docker ps shows that the clearml-agent-services is always restarting.

CONTAINER ID   IMAGE                                                 COMMAND                  CREATED         STATUS                          PORTS                                                            NAMES
e4d95f57b20f   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   7 minutes ago   Up 7 minutes                    8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp   clearml-webserver
30313b878a66   allegroai/clearml-agent-services:latest               "/usr/agent/entrypoi…"   7 minutes ago   Restarting (1) 46 seconds ago                                                                    clearml-agent-services
7e247b87d335   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   7 minutes ago   Up 7 minutes                    0.0.0.0:8008->8008/tcp, :::8008->8008/tcp, 8080-8081/tcp         clearml-apiserver
912c386c705c   docker.elastic.co/elasticsearch/elasticsearch:7.6.2   "/usr/local/bin/dock…"   7 minutes ago   Up 7 minutes                    9200/tcp, 9300/tcp                                               clearml-elastic
6ccf7f03c607   redis:5.0                                             "docker-entrypoint.s…"   7 minutes ago   Up 7 minutes                    6379/tcp                                                         clearml-redis
d1a98ae6cd21   mongo:3.6.5                                           "docker-entrypoint.s…"   7 minutes ago   Up 7 minutes                    27017/tcp                                                        clearml-mongo
42bfc545f7e0   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   7 minutes ago   Up 7 minutes                    8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp    clearml-fileserver

Looking at the logs for the this container, it is complaining about credentials not being correct:

(base) edmorris@ecm-clearml-server-001:/opt/clearml/config$ docker logs 30313b878a66
http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server http://apiserver:8008 ?

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)

http://13.81.201.17:8081 http://13.81.201.17:8080 http://apiserver:8008
WARNING: You are using pip version 20.3.3; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.

clearml_agent: ERROR: Failed getting token (error 401 from http://apiserver:8008): Unauthorized (invalid credentials) (failed to locate provided credentials)
jkhenning commented 3 years ago

Hi @ecm200,

Thanks for providing such detailed info 👍

What you're seeing is a result of a change we've made in ClearML Server 1.0.0 and up, which is still lacking in the documentation (we plan to release a revamped version of the documentation in the next few days which will address this as well).

In short, starting from 1.0.0, when running in the fixed-users mode, the ClearML Agent Services that runs as part of the server requires specific credentials to be provided. This is due to the fact that for security reasons, when running in the fixed-users mode the server will not support the hard-coded test credentials used by the agent by default (see here for a related discussion in our Slack channel).

To provide the required credentials to the agent-services, you will need to set the CLEARML_API_ACCESS_KEY and CLEARML_API_SECRET_KEY environment variables with appropriate credentials when starting the server using docker-compose - these can be a set of key/secret credentials generated in your ClearML Server's profile page, or simply the username/password of a fixed user you defined. Obviously, you can generate these credentials right now since your server is up and running, even though the agent-services is not booting up. Setting these is done in the same way as described in step #11 in Deploying ClearML Server: Linux and macOS / Deploying.

ecm200 commented 3 years ago

Thanks for the quick feedback @jkhenning.

So just to be clear then, if I go to the profile page on the ClearML WebUI, and generate a App Credential, just like I did to connect my local laptop ClearML installation to the server on the Azure VM, then I supply those generated keys by exporting them in the relevant environment variables?

image

jkhenning commented 3 years ago

Exactly right 🙂

ecm200 commented 3 years ago

@jkhenning Thanks so much.

This has resulted in a stable system.

CONTAINER ID   IMAGE                                                 COMMAND                  CREATED             STATUS             PORTS                                                            NAMES
9ef0d8cfc721   allegroai/clearml-agent-services:latest               "/usr/agent/entrypoi…"   About an hour ago   Up About an hour                                                                    clearml-agent-services
11a5d2041fb1   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   About an hour ago   Up About an hour   8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp, :::8080->80/tcp   clearml-webserver
f8fb56da4c77   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   About an hour ago   Up About an hour   0.0.0.0:8008->8008/tcp, :::8008->8008/tcp, 8080-8081/tcp         clearml-apiserver
fa3285bddd1a   allegroai/clearml:latest                              "/opt/clearml/wrappe…"   About an hour ago   Up About an hour   8008/tcp, 8080/tcp, 0.0.0.0:8081->8081/tcp, :::8081->8081/tcp    clearml-fileserver
f47a9774497f   redis:5.0                                             "docker-entrypoint.s…"   About an hour ago   Up About an hour   6379/tcp                                                         clearml-redis
5aed2c482329   docker.elastic.co/elasticsearch/elasticsearch:7.6.2   "/usr/local/bin/dock…"   About an hour ago   Up About an hour   9200/tcp, 9300/tcp                                               clearml-elastic
7ff1047655f6   mongo:3.6.5                                           "docker-entrypoint.s…"   About an hour ago   Up About an hour   27017/tcp                                                        clearml-mongo
ecm200 commented 3 years ago

@jkhenning

In relation to the issue of requiring the agent-services service needing secret keys that need to be set in environment variables.

What is the safest way of doing this on a routine basis?

I mean, whilst testing and learning the deployment, the VM hosting the server will not be up 24 hours, so what is the easiest way to set this automatically without the need to set environment variables and restart the server?

Also, enhancement suggestion, it would be really great if on the profile screen where it shows current access keys, it would be really useful if you could add a column so that people can add their own description to the secret key to know what service or machine is using it. When you local machines, compute nodes, and services all requiring secret keys, it will quickly become impossible to track which key is for what purpose, unless separate records are kept. Adding a function to create a recognizable tag by the user would really help in my opinion.