Ping first if conn is idle for too long

edgedb / edgedb-python

The official Python client library for EdgeDB

https://edgedb.com

Apache License 2.0

369 stars 45 forks source link

Ping first if conn is idle for too long #365

Closed fantix closed 2 years ago

fantix commented 2 years ago

The blocking sockets don't have active connection_lost() events, if the user is trying to execute a non-read-only query over a disconnected connection, is_closed() cannot capture the disconnection and the retry loop wouldn't work. In this case, we are sending a SYNC before the query, so that we could capture the connection error and reconnect.

This is tailored to handle the server-side session_idle_timeout; for regular network interruptions within the session_idle_timeout, we don't want to sacrifice performance by inserting more frequent SYNCs just to avoid connection errors on non-read-only queries.

fantix commented 2 years ago

This is tailored to handle the server-side session_idle_timeout; for regular network interruptions within the session_idle_timeout, we don't want to sacrifice performance by inserting more frequent SYNCs just to avoid connection errors on non-read-only queries.

Or maybe we do? We could also lazily select() for READ (and reconnect if necessary) every - say - (more than) 1 second per connection.

UPDATE: we don't, it's a rather marginal case and we cannot guarantee this doesn't happen at all.

fantix commented 2 years ago

Actually Elvis suggested over the meeting that, a proper fix should be that the server also sends a Terminate message before shutting down (idle) connections, so that the client won't have a vague state about the disconnection. I'll also do that, but this PR I think is still needed for EdgeDB 2.2 and lower.

tailhook commented 2 years ago

On unixes you can do just poll with POLLOUT. If OS thinks connection is down, it will return an error. I think you can achieve the same with select on windows.
As Terminate is a subject to race condition, you have to handle it in every receive loop, which is quite inconvenient. And TCP RST packet sent by OS (which is caught by (1)) is almost as reliable as application specific Terminate packet.
Sync is probably the only reliable way (it covers more than (1) and (2) combined), although costly. But that cost is probably not critical as it is only incurred when connection was idle for the long time.

Also we could probably skip Sync if we send Parse instead of cached query, but that is probably over-optimizing too niche case.

I would go with (1) or (3), the latter is in current PR. As I don't see a significant difference between (1) and (2), whereas (2) complicates implementations of all blocking and async clients.

1st1 commented 2 years ago

Sync is probably the only reliable way

Yeah.