acteng / update-your-capital-schemes

Update your capital schemes service.
https://update-your-capital-schemes.activetravelengland.gov.uk/
MIT License
4 stars 0 forks source link

Database connection occasionally broken in production #131

Open markhobson opened 1 month ago

markhobson commented 1 month ago

Following on from #127, we now see broken database connection errors logged in production, although due to pessimistic disconnect handling they are recovered from and the error is not propagated to the user:

Exception terminating connection <pg8000.legacy.Connection object at 0x3e27575c84d0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/pg8000/core.py", line 144, in _flush
    sock.flush()
  File "/usr/local/lib/python3.12/socket.py", line 726, in write
    return self._sock.send(b)
           ^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/pool/base.py", line 374, in _close_connection
    self._dialect.do_terminate(connection)
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 704, in do_terminate
    self.do_close(dbapi_connection)
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/engine/default.py", line 707, in do_close
    dbapi_connection.close()
  File "/usr/local/lib/python3.12/site-packages/pg8000/core.py", line 549, in close
    _flush(self._sock)
  File "/usr/local/lib/python3.12/site-packages/pg8000/core.py", line 146, in _flush
    raise InterfaceError("network error") from e
pg8000.exceptions.InterfaceError: network error
markhobson commented 1 month ago

Response from Google:

Thank you for the reply.

I, indeed, performed a reproduction of the issue from my side. Kindly allow me to explain the process done.

Considering the similar use case (Public IP, Unix socket, minimum instance of 1, and CPU is only allocated during requests), I have deployed a Cloud Run service with a simple functionality of retrieving all rows in the table and returning it as JSON data. In its initial lifespan, I have seen that the request was going through the Cloud SQL database and the service was working as expected. With this, the Cloud Run service’s connection was left to idle state, as it was done executing the request.

A couple of hours later, I inspected first the Cloud SQL PostgreSQL’s activity in order to check the connections, and I found that the Cloud Run service was not present, meaning that the connection was closed/terminated after a certain amount of time being idle. Nonetheless, I invoked the service again, to which I have observed the "network error" logs similar to the ones you have observed, and as per your description, it was immediately refreshed after the second invocation.

Additionally, I have also changed the service’s CPU allocation to "CPU is always allocated" in order to check whether the similar behavior could be observed. With this setting, its initial functionality was working as expected. Nonetheless, I have also seen that the connection was idle on the Cloud SQL instance, and with it, I have let the connection close/terminate automatically to match the first test I have done.

Once I had made sure that the connection is closed (ie. not present when querying the statement "SELECT * FROM pg_stat_activity WHERE datname = ''"), I reattempted to invoke the service, to which I have seen that the connection was established through the Cloud SQL instance, the request succeeded, and no network error logs were observed.

With this test, my hypothesis for the issue might be emanating from the service’s CPU allocation, as I believe that the Cloud SQL instance is initially refusing the connection from the service when there is no CPU allocated to the minimum instance, and afterwards, when the CPU is allocated, then the connection will be established. The reasoning behind this was that when the network error occurs, I have seen that the service’s connection to the Cloud SQL instance was not being registered when querying the database engine’s activity.

For this reason, what I could suggest for the matter is that either:

  • Use the "CPU is always allocated" option for the service.
  • Make sure that the connection from Cloud Run service to Cloud SQL instance is being kept alive for a long period of time, regardless if the connection’s activity is idle.

In any case, as you have requested to keep the case open, I will still acknowledge this request, and I will be reaching out to you by the 1st of August, 19:00 CEST, should I have not heard anything from your side. Nonetheless, kindly let me know if the above information helps.

markhobson commented 1 month ago

Possible steps to diagnose what causes a broken pipe - for each step, does Cloud SQL close the connection over time?

  1. Create public Cloud SQL database and connect locally using SQLAlchemy
  2. Make database private and add Cloud SQL Auth Proxy using TCP locally
  3. Change Cloud SQL Auth Proxy to use Unix socket
  4. Create Cloud Run service with one minimum instance to connect to database using Cloud SQL Auth Proxy Unix socket
  5. Change Cloud Run to always attach CPU
Sparrow0hawk commented 2 weeks ago

My message:

I have been investigating this issue further and have some additional questions/observations.

The application that is experiencing these errors is a Cloud Run service that connects to the Cloud SQL instance using the Connect with Unix socket method. My understanding is that having configured my Cloud Run service to mount a volume called /cloudsql and connect that to my Cloud SQL connection name the Cloud Run service creates a Unix socket at the mount point using Cloud SQL Auth proxy.

As a Unix socket it isn’t possible to configure any sort of keepalive, so any adjustments in TCP keepalive settings won’t improve the network errors I am encountering. Even if keepalives were worth setting I’m unclear how that would solve this issue if outside of requests the Cloud Run service has its CPU deallocated and thus would be unable to send a keepalive.

To maintain this Unix socket connection requires that Cloud SQL Auth proxy is also running within (or alongside) the Cloud Run service. For our service we have minimum instances set to 1 to prevent cold starts. This means that there is always one instance running but unless it is actively processing a request the instance is idle, where memory resources may be persisted but CPU is deallocated. If the instance requires Auth proxy to be running to maintain the Unix socket what happens to the Auth proxy process during instance idling? Is the process simply suspended and resumed when CPU is reallocated or does Auth proxy have to reauthenticate and restart when CPU is reallocated? Or if during the time Auth proxy is suspended it is required to regenerate it’s SSL cert would that introduce a fractional delay when resuming the instance where Auth proxy isn’t available via the Unix socket that could cause this error?

Reply:

Hello, Thank you for the reply. Based on the description, I understand that the desired configuration for the instance was still to use the "CPU is only allocated during request processing" and that the latest inquiry was about the functionality of Cloud SQL Auth Proxy behind the scenes when using a Unix socket for your Cloud Run service. In regards to the Cloud SQL Auth Proxy activity within Cloud Run in terms of the connections considering the CPU allocation configuration, I would like to mention that, in application's perspective, it automatically uses the Cloud SQL Auth Proxy, as the connection mode used was Unix socket using the mount point of "/cloudsql" Nonetheless, I have reviewed the following Google Cloud documentation [1], which contains the details about CPU allocation for Cloud Run services (excluding Cloud Run jobs) and I also reviewed the internal details of my previous reproduction, to which I have found the possible behavior happened during the request processing at the time the network errors were observed. To explain this in detail, the CPU was not allocated during the time of the test, and it was being allocated when there was an incoming request, in which the application is being processed, which includes establishing a connection to the Cloud SQL instances (ie, authenticating, connecting, processing SQL statements, etc.) With this, I have noticed that the authentication part had some delays during the application process, which may indicate that the overall process failed in the initial request, and then will eventually succeed when the CPU had been allocated. Due to this, I believe that the Cloud SQL Auth Proxy part is being suspended, and will be restarted, alongside with the application, when CPU is allocated on the instance. This can be justified as Cloud Run autoscaling will still be in effect within the instance's lifecycle. Furthermore, I would also like to include that a retry strategy for the application may also be a feasible solution in order to at least mitigate the network error issues observed. Kindly let me know if the above information cleared your doubts. Kind Regards,