dolthub / dolt

Dolt – Git for Data
Apache License 2.0
17.72k stars 503 forks source link

Docker Mysql Server Resets When Connecting Immediately After Launch #8131

Closed farhanhubble closed 1 month ago

farhanhubble commented 1 month ago

Scenario:

MRE:

# No bytes received when not waiting before connection (time.sleep(0))
docker start DoltServer && python -c 'import socket; import time; time.sleep(0); sock=socket.create_connection(("localhost", 3307), timeout=1); sock.settimeout(1); d=sock.recv(1024); print(d)' ; docker stop DoltServer

# Valid bytes when waiting (sleep(2))
> docker start DoltServer && python -c 'import socket; import time; time.sleep(2); sock=socket.create_connection(("localhost", 3307), timeout=1); sock.settimeout(1); d=sock.recv(1024); print(d)' ; docker stop DoltServer
> b'J\x00\x00\x00\n8.0.33\x00\x01\x00\x00\x00\x0bzke)r4@\x00\x8f\xa2!\x00\x00;\x01\x15\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x7fDh>`ozb@Ri5\x00mysql_native_password\x00'

Additional Info:

behavior: read_only: false autocommit: true persistence_behavior: load disable_client_multi_statements: false dolt_transaction_commit: false event_scheduler: "ON"

user: name: "" password: ""

listener: host: 0.0.0.0 port: 3306 max_connections: 100 read_timeout_millis: 28800000 write_timeout_millis: 28800000 tls_key: null tls_cert: null require_secure_transport: null allow_cleartext_passwords: null

performance: query_parallelism: null

data_dir: .

cfg_dir: .doltcfg

metrics: labels: {} host: null port: -1

remotesapi: {}

privilege_file: .doltcfg/privileges.db

branch_control_file: .doltcfg/branch_control.db

user_session_vars: []

timsehn commented 1 month ago

Thanks. This seems like we're accepting the connection in a state where we can't service it and then never getting back to it. Good repro. We'll get started debugging this in the next day or so.

fulghum commented 1 month ago

Thanks for providing such easy repro steps @farhanhubble! 🙏 I've been digging into this one this morning, and I can repro the behavior you're seeing.

I'm pretty sure this is a timing issue... when docker start starts up the stopped container, it isn't giving the container long enough to fully get back into service, so the python script connects to the mapped port, but there isn't anything inside the docker container ready to handle that request, hence the empty response. One of the clues supporting that is that when I take out the docker stop DoltServer command at the end of your repro, I don't see this behavior anymore.

I'm going to experiment with adding a HEALTHCHECK to our sql-server Dockerfile. I believe that can get docker start to pause until the server is ready to handle connections.

Thanks for helping us identify this issue. I'll have another update for you soon.

fulghum commented 1 month ago

I dug into this one some more... I'm not a Docker expert, so this was fun for me to get deeper into what's going on with various container commands and container startup...

docker start will always return as soon as soon as the container is started. This means that the initialization script for the Dolt sql-server image (or for the MySQL server image) will still be running when docker start returns. Defining health checks in a Dockerfile does not change this – they expose a health status for the container, but they do not cause docker start to wait for the container to be healthy before they return.

docker run has a slightly different behavior. It does start a container, just like docker start, but by default, it also attaches to that container to show the STDOUT from the main process. This ends up giving the initialization scripts a little more time to run before the container is used by another process, which may be enough for container initialization to complete. (You can also pass --attach to docker start, but it doesn't help in your case, because your python script won't run until you detach from that docker container or stop the docker container).

I noticed that the MySQL Docker image (which we use an example for Dolt's sql-server image) does not define healthchecks and I also saw in the documentation that it has the same behavior with initialization taking a few seconds. (See the "No connections until MySQL init complete" section in the MySQL Docker image documentation). That's specifically around initializing a brand new MySQL install, but it illustrates that the same behavior is happening with docker start always returning immediately.

The standard approach for this with Docker containers seems to be to poll the container to check for readiness. We provide an example of doing this with Kubernetes in our docs in case that's helpful. For your example, I think a good approach would be to have a small loop in your Python code that checks to see if d has received enough bytes to indicate that the server is running. That loop could retry every 250ms or so, for a set number of times, and then break out of the loop when it detects a successful connection, so the rest of the application code can run. I've included an example python script that does this below in case that's helpful.

Let us know if any of that doesn't make sense, or if you just think I'm misunderstanding something that's going on here. We're open to adding more of this info to our docs if you think that might have been helpful for you, too.

Python check_connection.py Example

import socket
import time

# Parameters
host = "localhost"
port = 3307
timeout = 1
check_interval = 0.25  # 250 milliseconds
max_duration = 5  # 5 seconds

# Function to check for data
def check_connection():
    start_time = time.time()
    while True:
        sock = socket.create_connection((host, port), timeout=timeout)
        sock.settimeout(timeout)

        try:
            data = sock.recv(1024)
            if len(data) > 4:
                print("Data received:", data)
                return
        except socket.timeout:
            pass
        finally:
            sock.close()

        if time.time() - start_time > max_duration:
            print("Timeout: No sufficient data received within 5 seconds.")
            return

        time.sleep(check_interval)

# Run the data check function
check_connection()

This script can be used like this: docker start DoltServer && python check_connection.py && docker stop DoltServer

farhanhubble commented 1 month ago

Thank you so much @fulghum! I'm no docker expert myself. The interesting thing is that I noticed this issue because I had written a await_heartbeat() function to check that the server was ready and all it did was something like nc -z localhost 3037 and check for the return code. Because of the bug though I'd get a return code of 0 and then when I tried to read data I'd get the PyMySQL error. I modified the function to check liveness by actually trying to read data, which helped me get around the problem.

I wasn't aware of the subtleties of docker start vs run but in this case using docker run did not make a difference. I did not want to count on a process being started to decide if the process was ready for network connections and was relying on await_heartbeat(). In hindsight, I should have checked with a read operation earlier since there are a few ways for ports to show up open even when the process that opened them is gone.

fulghum commented 1 month ago

My pleasure! This was a fun one to dig into. 😄 I'm glad you found a way to poll for the server readiness and verify that the server has started up successfully. For others who may read this issue... polling for a successful read is a much more resilient approach than a static sleep since startup times can change for a variety of reasons such as load on the host or the size of the database.

I'll go ahead and resolve since we got this figured out, but don't hesitate to let us know if there's anything else we can do to help you build with Dolt!