Mlflow UI getting Timeout and not being reachable again

stefano-brambilla-venchi commented 1 month ago

Description

When I run kedro mlflow ui, which initially works correctly, after some time (usually one hour) I get a timeout and I am not able to reach again the UI.

Context

I am just using the service in a very standard kedro pipeline.

Steps to Reproduce

I simply run kedro mlflow ui and use it as intended. I correctly reach the UI and use it. After a random time, usually a couple hours, the UI stops working, and I must restart the service.

I noticed that when the UI stops working I usually get an [CRITICAL] WORKER TIMEOUT (pid:...) --> [ERROR] Error handling request (no URI read) in the log of mlflow. I however see the service alive on the port on other pids through the command lsof -i :5000.

Expected Result

The service should never go down.

Actual Result

This is an extract of my log:

[2024-07-10 12:19:27 +0000] [5351] [INFO] Starting gunicorn 22.0.0
[2024-07-10 12:19:27 +0000] [5351] [INFO] Listening at: http://127.0.0.1:5002 (5351)
[2024-07-10 12:19:27 +0000] [5351] [INFO] Using worker: sync
[2024-07-10 12:19:27 +0000] [5352] [INFO] Booting worker with pid: 5352
[2024-07-10 12:19:27 +0000] [5353] [INFO] Booting worker with pid: 5353
[2024-07-10 12:19:27 +0000] [5354] [INFO] Booting worker with pid: 5354
[2024-07-10 12:19:27 +0000] [5355] [INFO] Booting worker with pid: 5355
[2024-07-10 12:33:07 +0000] [5351] [INFO] Handling signal: winch
[2024-07-10 12:33:07 +0000] [5351] [INFO] Handling signal: winch
[2024-07-10 12:33:07 +0000] [5351] [INFO] Handling signal: winch
[2024-07-10 13:29:49 +0000] [5351] [CRITICAL] WORKER TIMEOUT (pid:5353)
[2024-07-10 13:29:49 +0000] [5353] [ERROR] Error handling request (no URI read)
Traceback (most recent call last):
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/workers/sync.py", line 134, in handle
    req = next(parser)
          ^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/parser.py", line 42, in __next__
    self.mesg = self.mesg_class(self.cfg, self.unreader, self.source_addr, self.req_count)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/message.py", line 257, in __init__
    super().__init__(cfg, unreader, peer_addr)
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/message.py", line 60, in __init__
    unused = self.parse(self.unreader)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/message.py", line 269, in parse
    self.get_data(unreader, buf, stop=True)
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/message.py", line 260, in get_data
    data = unreader.read()
           ^^^^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 37, in read
    d = self.chunk()
        ^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/http/unreader.py", line 64, in chunk
    return self.sock.recv(self.mxchunk)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/brambilla/src/venv-11/lib/python3.11/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1
[2024-07-10 13:29:49 +0000] [5353] [INFO] Worker exiting (pid: 5353)
[2024-07-10 13:29:50 +0000] [17110] [INFO] Booting worker with pid: 17110
[2024-07-10 14:04:03 +0000] [5351] [INFO] Handling signal: winch
[2024-07-10 14:04:03 +0000] [5351] [INFO] Handling signal: winch

Your Environment

Python 3.11.9 Kedro 0.19.5 MLflow 2.13.0 Kedro-mlflow 0.12.2

My Mlflow URI is localhost (127.0.0.1), I work on an Azure remote VM with ssh-tunnel port forwarding. The VM is an Ubuntu 22.0.4. The laptop from where I am reaching the port is a Windows 10. The ssh tunnel is performed through Visual Studio Code.

Does the bug also happen with the last version on master?

I have not tried.

Galileo-Galilei commented 1 month ago

Hi sorry to hear that. Unfortunately, this is something that will be really hard to debug it may be some network error given you setup but hardly possible to tell. Can you try

mlflow ui --backend-store-uri file:///path/to/mlruns --host 127.0.0.1 --port 5000

and see if the error still happen? If yes, this is a problem of mlflow / gunicorn / your network I can't reallly fix. If no, This is something I should investigate on kedro-mlflow side.

Galileo-Galilei commented 1 month ago

Hi @stefano-brambilla-venchi, did you get a chance to try with above suggestion? Are you still experiencing the error?

Galileo-Galilei commented 2 weeks ago

I close the issue since it is very likely not related to kedro-mlflow, but feel free to reopen if you have more details.

Galileo-Galilei / kedro-mlflow