allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 654 forks source link

Elasticsearch config seems non functional #1072

Open chriswue opened 1 year ago

chriswue commented 1 year ago

Describe the bug

I've tried to connect ClearML to a standalone elasticsearch instance but it's refusing to connect.

To reproduce

Run the clearml container with apiserver command and configure ES via the environment variables CLEARML_ELASTIC_SERVICE_HOST, CLEARML_ELASTIC_SERVICE_PORT, CLEARML_ELASTIC_SERVICE_USERNAME, CLEARML_ELASTIC_SERVICE_PASSWORD.

The result is always:

[2023-07-12 22:52:20,573] [9] [ERROR] [clearml.app_sequence] Error connecting to Elasticsearch: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7f9d77582640>: Failed to establish a new connection: [Errno -2] Name or service not known) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7f9d77582640>: Failed to establish a new connection: [Errno -2] Name or service not known)
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/clearml/apiserver/server.py", line 10, in <module>
    AppSequence(app).start(request_handlers=RequestHandlers())
  File "/opt/clearml/apiserver/server_init/app_sequence.py", line 42, in start
    self._init_dbs()
  File "/opt/clearml/apiserver/server_init/app_sequence.py", line 101, in _init_dbs
    raise Exception(
Exception: Error starting server: failed connecting to ElasticSearch service

So to debug this I started the clearml container with a shell and ran python3 Now this works (below is the Python REPL output):

>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch(hosts=['https://<myinstance>.azure.elastic-cloud.com:443'],http_auth=('elastic','<password>'))
>>> es.info()
{'name': 'instance-0000000000', 'cluster_name': '<redacted>', 'cluster_uuid': '<redacted>', 'version': {'number': '8.8.2', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '98e1271edf932a480e4262a471281f1ee295ce6b', 'build_date': '2023-06-26T05:16:16.196344851Z', 'build_snapshot': False, 'lucene_version': '9.6.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}

However after looking how ES is configured in clearml, it seems like it's passing an array of dictionaries rather than an array of strings and this fails:

>>> es = Elasticsearch(hosts=[{"host":'https://<myinstance>.azure.elastic-cloud.com', "port": 443}], http_auth=('elastic','<password>'))
>>> es.info()
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 159, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 61, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
    response = self.pool.urlopen(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 386, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 670, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 171, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fe9e904e3a0>: Failed to establish a new connection: [Errno -2] Name or service not known

Trying to set hosts config to ['myhost:port'] yields in the following error:

[2023-07-12 23:18:21,662] [9] [INFO] [clearml.schema_reader] regenerating schema cache
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/clearml/apiserver/server.py", line 6, in <module>
    from apiserver.server_init.app_sequence import AppSequence
  File "/opt/clearml/apiserver/server_init/app_sequence.py", line 20, in <module>
    from apiserver.mongo.initialize import (
  File "/opt/clearml/apiserver/mongo/initialize/__init__.py", line 9, in <module>
    from .pre_populate import PrePopulate
  File "/opt/clearml/apiserver/mongo/initialize/pre_populate.py", line 62, in <module>
    class PrePopulate:
  File "/opt/clearml/apiserver/mongo/initialize/pre_populate.py", line 64, in PrePopulate
    event_bll = EventBLL()
  File "/opt/clearml/apiserver/bll/event/event_bll.py", line 85, in __init__
    self.es = events_es or es_factory.connect("events")
  File "/opt/clearml/apiserver/es_factory.py", line 80, in connect
    cluster_config = cls.get_cluster_config(cluster_name)
  File "/opt/clearml/apiserver/es_factory.py", line 143, in get_cluster_config
    set_host_prop("host", host)
  File "/opt/clearml/apiserver/es_factory.py", line 138, in set_host_prop
    entry[key] = value
TypeError: 'str' object does not support item assignment

Looking at the ES python SDK documentation doesn't really reveal much aside from the fact that none of the examples I've seen anywhere passes a structure like this [{"host": "<host>", "port": 123}]. All examples only ever pass lists of strings like ['host1:port', 'host2:port']. So I don't know where the problem really is but the current implementation doesn't work.

Expected behaviour

I should be able to connect to a non-standard (i.e. not localhost) ES instance.

Environment

chriswue commented 1 year ago

So, since I'm running an Azure cloud instance, I managed to get it to work by creating a hosts.conf with the following content:

elastic {
    events {
        hosts: [{host: 'localhost', port: 9200}]
        args {
            cloud_id: '<cloud id of deployment>'
        }
    }
    workers {
        hosts: [{host: 'localhost', port: 9200}]
        args {
            cloud_id: '<cloud id of deployment>'
        }
    }
}

One needs to make sure to escape = as \= - the cloud_id contains a base64 encoded string that ends in = and the config file parser treats those as special characters.

While this solved my immediate problem, it doesn't fix the more general case that may need an actual host name configured.

jkhenning commented 1 year ago

Hi @chriswue, when using the environment variables, did you see Using override elastic host... in the apiserver log?

chriswue commented 1 year ago

Yes, all those messages appeared. I also patched the server init python file to dump out extra information and the correct ES host was being passed in.

jkhenning commented 1 year ago

The error you attched for that case says Name or service not known which basically indicated the apiserver could not reach that address...?

chriswue commented 1 year ago

No, what I wrote is that using this constructor doesn't work (which is what ClearML is doing):

es = Elasticsearch(hosts=[{"host":'https://<myinstance>.azure.elastic-cloud.com', "port": 443}]

While this works:

es = Elasticsearch(hosts=['https://<myinstance>.azure.elastic-cloud.com:443']

Tested within the same ClearML container

jkhenning commented 1 year ago

OK, but since we're doing it always in the same way, you're saying it works for the default values we use but not when you override them?

chriswue commented 1 year ago

Looks like it, here is the cleanest reproducer I can come up with: image

Possibly a bug in the elastic SDK? 7.13.3 is now more than 2 years old, might be time upgrade to 8.8?