allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
364 stars 132 forks source link

Update from 1.14.1 to 1.15.0 leads to several fatal issues when booting #242

Closed H4dr1en closed 2 months ago

H4dr1en commented 2 months ago

clearml-server running fine at version 1.14.1 in docker 19.03.13, docker-compose 1.27.4

After upgrading clearml-server to 1.15.0 (using docker-compose pull), I get the following errors:

Redis I was able to fix this one by specifying the version 6.2.11 instead of 6.2

clearml-redis   | 1:M 16 Apr 2024 13:27:26.933 # Fatal: Can't initialize Background Jobs.
clearml-redis   | 1:C 16 Apr 2024 13:27:29.937 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
clearml-redis   | 1:C 16 Apr 2024 13:27:29.946 # Redis version=6.2.14, bits=64, commit=00000000, modified=0, pid=1, just started
clearml-redis   | 1:C 16 Apr 2024 13:27:29.946 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf
clearml-redis   | 1:M 16 Apr 2024 13:27:29.947 * monotonic clock: POSIX clock_gettime
clearml-redis   | 1:M 16 Apr 2024 13:27:29.955 * Running mode=standalone, port=6379.
clearml-redis   | 1:M 16 Apr 2024 13:27:29.955 # Server initialized
clearml-redis   | 1:M 16 Apr 2024 13:27:29.955 # WARNING Memory overcommit must be enabled! Without it, a background save or replication may fail under low memory condition. Being disabled, it can can also cause failures without low memory condition, see https://github.com/jemalloc/jemalloc/issues/1328. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf andthen reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.

file server

clearml-fileserver | Loading config from /opt/clearml/fileserver/config/default
clearml-fileserver | Loading config from file /opt/clearml/fileserver/config/default/fileserver.conf
clearml-fileserver | Loading config from file /opt/clearml/fileserver/config/default/logging.conf
clearml-fileserver | Loading config from /opt/clearml/config
clearml-fileserver | Loading config from file /opt/clearml/config/apiserver.conf
clearml-fileserver |  * Serving Flask app 'fileserver'
clearml-fileserver |  * Debug mode: off
clearml-fileserver | ----------------------------------------
clearml-fileserver | Exception occurred during processing of request from ('10.x.x.x', 10116)
clearml-fileserver | Traceback (most recent call last):
clearml-fileserver |   File "/usr/local/lib/python3.9/socketserver.py", line 316, in _handle_request_noblock
clearml-fileserver |     self.process_request(request, client_address)
clearml-fileserver |   File "/usr/local/lib/python3.9/socketserver.py", line 697, in process_request
clearml-fileserver |     t.start()
clearml-fileserver |   File "/usr/local/lib/python3.9/threading.py", line 899, in start
clearml-fileserver |     _start_new_thread(self._bootstrap, ())
clearml-fileserver | RuntimeError: can't start new thread
clearml-fileserver | ----------------------------------------

Async_delete

async_delete    | OpenBLAS blas_thread_init: pthread_create failed for thread 1 of 2: Operation not permitted
async_delete    | OpenBLAS blas_thread_init: RLIMIT_NPROC -1 current, -1 max
async_delete    | Traceback (most recent call last):
async_delete    |   File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
async_delete    |     return _run_code(code, main_globals, None,
async_delete    |   File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
async_delete    |     exec(code, run_globals)
async_delete    |   File "/opt/clearml/apiserver/jobs/async_urls_delete.py", line 22, in <module>
async_delete    |     from apiserver.bll.storage import StorageBLL
async_delete    |   File "/opt/clearml/apiserver/bll/storage/__init__.py", line 4, in <module>
async_delete    |     from clearml.backend_config.bucket_config import (
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/__init__.py", line 5, in <module>
async_delete    |     from .task import Task
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 45, in <module>
async_delete    |     from .backend_interface.metrics import Metrics
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/__init__.py", line 2, in <module>
async_delete    |     from .task import Task
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/task/__init__.py", line 1, in <module>
async_delete    |     from .task import Task
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/task/task.py", line 31, in <module>
async_delete    |     from ...binding.artifacts import Artifacts
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/binding/artifacts.py", line 25, in <module>
async_delete    |     from ..backend_interface.metrics.events import UploadEvent
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/metrics/__init__.py", line 2, in <module>
async_delete    |     from .interface import Metrics
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/metrics/interface.py", line 17, in <module>
async_delete    |     from .events import MetricsEventAdapter
async_delete    |   File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/metrics/events.py", line 8, in <module>
async_delete    |     import numpy as np
async_delete    |   File "/usr/local/lib/python3.9/site-packages/numpy/__init__.py", line 130, in <module>
async_delete    |     from numpy.__config__ import show as show_config
async_delete    |   File "/usr/local/lib/python3.9/site-packages/numpy/__config__.py", line 4, in <module>
async_delete    |     from numpy.core._multiarray_umath import (
async_delete    |   File "/usr/local/lib/python3.9/site-packages/numpy/core/__init__.py", line 24, in <module>
async_delete    |     from . import multiarray
async_delete    |   File "/usr/local/lib/python3.9/site-packages/numpy/core/multiarray.py", line 10, in <module>
async_delete    |     from . import overrides
async_delete    |   File "/usr/local/lib/python3.9/site-packages/numpy/core/overrides.py", line 8, in <module>
async_delete    |     from numpy.core._multiarray_umath import (
async_delete    |   File "<frozen importlib._bootstrap>", line 203, in _lock_unlock_module
async_delete    | KeyboardInterrupt

Api server

clearml-apiserver | [2024-04-16 14:47:28,963] [8] [INFO] [clearml.redis_manager] Using override redis host redis
clearml-apiserver | [2024-04-16 14:47:28,963] [8] [INFO] [clearml.redis_manager] Using override redis port 6379
clearml-apiserver | [2024-04-16 14:47:28,985] [8] [INFO] [clearml.es_factory] Using override elastic host ip-10-x-x-x.y-y-y.compute.internal
clearml-apiserver | [2024-04-16 14:47:28,985] [8] [INFO] [clearml.es_factory] Using override elastic port 9200
clearml-apiserver | [2024-04-16 14:47:29,094] [8] [INFO] [clearml.schema_reader] loading schema from cache
clearml-apiserver | [2024-04-16 14:47:29,430] [8] [INFO] [clearml.app_sequence] ################ API Server initializing #####################
clearml-apiserver | [2024-04-16 14:47:29,432] [8] [INFO] [clearml.database] Initializing database connections
clearml-apiserver | [2024-04-16 14:47:29,433] [8] [INFO] [clearml.database] Using override mongodb host mongo
clearml-apiserver | [2024-04-16 14:47:29,433] [8] [INFO] [clearml.database] Using override mongodb port 27017
clearml-apiserver | [2024-04-16 14:47:29,436] [8] [INFO] [clearml.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
clearml-apiserver | [2024-04-16 14:47:29,439] [8] [INFO] [clearml.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
clearml-apiserver | Traceback (most recent call last):
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/mongoengine/connection.py", line 348, in _create_connection
clearml-apiserver |     return mongo_client_class(**connection_settings)
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 837, in __init__
clearml-apiserver |     self._get_topology()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/mongo_client.py", line 1214, in _get_topology
clearml-apiserver |     self._topology.open()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 192, in open
clearml-apiserver |     self._ensure_opened()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 596, in _ensure_opened
clearml-apiserver |     self._update_servers()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/topology.py", line 747, in _update_servers
clearml-apiserver |     server.open()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/server.py", line 49, in open
clearml-apiserver |     self._monitor.open()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/monitor.py", line 79, in open
clearml-apiserver |     self._executor.open()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/pymongo/periodic_executor.py", line 87, in open
clearml-apiserver |     thread.start()
clearml-apiserver |   File "/usr/local/lib/python3.9/threading.py", line 899, in start
clearml-apiserver |     _start_new_thread(self._bootstrap, ())
clearml-apiserver | RuntimeError: can't start new thread
clearml-apiserver | 
clearml-apiserver | During handling of the above exception, another exception occurred:
clearml-apiserver | 
clearml-apiserver | Traceback (most recent call last):
clearml-apiserver |   File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
clearml-apiserver |     return _run_code(code, main_globals, None,
clearml-apiserver |   File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
clearml-apiserver |     exec(code, run_globals)
clearml-apiserver |   File "/opt/clearml/apiserver/server.py", line 10, in <module>
clearml-apiserver |     AppSequence(app).start(request_handlers=RequestHandlers())
clearml-apiserver |   File "/opt/clearml/apiserver/server_init/app_sequence.py", line 42, in start
clearml-apiserver |     self._init_dbs()
clearml-apiserver |   File "/opt/clearml/apiserver/server_init/app_sequence.py", line 89, in _init_dbs
clearml-apiserver |     empty_db = check_mongo_empty()
clearml-apiserver |   File "/opt/clearml/apiserver/mongo/initialize/migration.py", line 22, in check_mongo_empty
clearml-apiserver |     collection_names = get_db(alias).list_collection_names()
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/mongoengine/connection.py", line 386, in get_db
clearml-apiserver |     conn = get_connection(alias)
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/mongoengine/connection.py", line 335, in get_connection
clearml-apiserver |     connection = _create_connection(
clearml-apiserver |   File "/usr/local/lib/python3.9/site-packages/mongoengine/connection.py", line 350, in _create_connection
clearml-apiserver |     raise ConnectionFailure(f"Cannot connect to database {alias} :\n{e}")
clearml-apiserver | mongoengine.connection.ConnectionFailure: Cannot connect to database auth-db :
clearml-apiserver | can't start new thread
clearml-apiserver | Loading config from /opt/clearml/apiserver/config/default
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/apiserver.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/secure.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/hosts.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/logging.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/auth.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/queues.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/storage_credentials.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/_mongo.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/tasks.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/async_urls_delete.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/projects.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/models.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/events.conf
clearml-apiserver | Loading config from file /opt/clearml/apiserver/config/default/services/organization.conf
clearml-apiserver | Loading config from /opt/clearml/config
clearml-apiserver | Loading config from file /opt/clearml/config/apiserver.conf
clearml-apiserver exited with code 1

More info about the instance: it's a aws t3.small instance

$ uname -a
Linux ip-x-y-z 5.4.0-1103-aws #111~18.04.1-Ubuntu SMP Tue May 23 20:04:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 7668
max locked memory       (kbytes, -l) 65536
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1048576
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 7668
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I didn't make changes in the docker-compose.yml

jkhenning commented 2 months ago

Hi @H4dr1en, this seems really strange and the same sort of issue is apparent in most containers - is it possible there's some threading issue in the system which causes our container to fail spawning new threads?

evg-allegro commented 2 months ago

Hi @H4dr1en , can you please check the docker engine version? For the new apiserver images the docker engine should be 20.10.10 or later

H4dr1en commented 2 months ago

Hi both, updating docker fixed it, thanks 👍