aimhubio / aim

Aim 💫 — An easy-to-use & supercharged open-source experiment tracker.
https://aimstack.io
Apache License 2.0
4.93k stars 297 forks source link

Securing Aim Remote Tracking server using SSL key and certificate #3172

Open JeroenVranken opened 1 week ago

JeroenVranken commented 1 week ago

Securing Aim Remote Tracking server using SSL key and certificate

Hi, first of all I appreciate all the work you've put into making Aim!

I am having some trouble securing the connection to the Aim Remote Tracking (RT) Server, and was wondering if you could help me out.

I recently setup a virtual machine on Azure, which is running both the Aim RT Server and the Aim UI. To do this, I have used a docker-compose.yml, which brings up both the server and the UI. This is working properly, I can log runs from another machine and see them appear in the UI, great.

However, now I want to secure the connection to the remote tracking server using SSL, as described here. I've created a self-signed key and certificate file using openssl, as described here.

Whenever I bring up the server using this command, eveything seems in working order, I do not get any errors etc:

aim server --repo ~/mycontainer/aim/ --ssl-keyfile ~/secrets/server.key --ssl-certfile ~/secrets/server.crt --host 0.0.0.0 --dev --port 53800

But then when I try to log a run from another machine, I get the following error on the client:

azureuser@ml-ci-jvranken-prd:~/cloudfiles/code/Users/jvranken/aim-tracking-server$ python aim_test.py 
Failed to connect to Aim Server. Have you forgot to run `aim server` command?
Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/anaconda/envs/verhuiskans/lib/python3.10/http/client.py", line 287, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/utils.py", line 14, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/client.py", line 138, in connect
    response = requests.get(endpoint, headers=self.request_headers)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/requests/adapters.py", line 682, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/clusters/ml-ci-jvranken-prd/code/Users/jvranken/aim-tracking-server/aim_test.py", line 7, in <module>
    run = Run(
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/run.py", line 859, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/run.py", line 272, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/base_run.py", line 34, in __init__
    self.repo = get_repo(repo)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo_utils.py", line 26, in get_repo
    repo = Repo.from_path(repo)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo.py", line 210, in from_path
    repo = Repo(path, read_only=read_only, init=init)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/sdk/repo.py", line 121, in __init__
    self._client = Client(remote_path)
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/client.py", line 50, in __init__
    self.connect()
  File "/anaconda/envs/verhuiskans/lib/python3.10/site-packages/aim/ext/transport/utils.py", line 18, in wrapper
    raise RuntimeError(error_message)
RuntimeError: Failed to connect to Aim Server. Have you forgot to run `aim server` command?

Do you have any clue as to why this is not working? Here is the docker-compose.yaml and the python file I'm using:

services:
  ui:
    image: aimstack/aim:3.20.1
    container_name: aim_ui
    restart: unless-stopped
    command: up --host 0.0.0.0 --port 43800 --dev
    ports:
      - 80:43800
    volumes:
    - ~/mycontainer/aim:/opt/aim
    networks:
      - aim

  server:
    image: aimstack/aim:3.20.1
    container_name: aim_server
    restart: unless-stopped
    command: server --host 0.0.0.0 --dev --ssl-keyfile /opt/secrets/server.key --ssl-certfile /opt/secrets/server.crt
    ports:
      - 53800:53800
    volumes:
    - ~/mycontainer/aim:/opt/aim
    - ~/secrets:/opt/secrets
    networks:
      - aim

networks:
  aim:
    driver: bridge
from aim import Run

# AIM_REPO='/home/azureuser/mycontainer/aim'
AIM_REPO='aim://REDACTED:53800'
AIM_EXPERIMENT='SSL-server'

run = Run(
    repo=AIM_REPO,
    experiment=AIM_EXPERIMENT
)

hparams_dict = {
    'learning_rate': 0.001,
    'batch_size': 32,
}
run['hparams'] = hparams_dict

# log metric
for i in range(30):
    if i % 5 == 0:
        i = i * 0.347
    run.track(float(i), name='numbers')
SGevorg commented 1 week ago

@JeroenVranken thanks for the issue. This could be related to the auth token things we have added recently. @mihran113 @alberttorosyan what do you guys think?