ADBC Python Postgres - Stuck connections to the database

gaspardc-met commented 3 months ago

What happened?

Context before the bug (working):

Postgres database on Kubernetes with several tables
4 services (webapp, machine learning inference, and FastAPI backend APIs) deployed on kubernetes and fetching data from postgres
1 service, data orchestrator, writing data to Postgres
Fetching data from PG with pd.read_sql from Pandas and a SQLalchemy engine
Been doing this for 1+ year without any Postgres issues

Switching to ADBC:

Following my upgrade to pandas >2.0.0 I wanted to switch to adbc_driver_postgresql's dbapi connection with pd.read_sql
Initial tests were great, it was faster than before
Deployed this to production on all aforementioned services twice (initially with connection caching, then no caching and properly closing each and every connection
Once again initially smooth, everything worked and was fast

Problem:

In both instances of the deployment, within ~12 hours, the connections would be stuck
Webapp or another service would create an ADBC connection, and run the sql query with pd.read_sql (know this through caching) and then wait indefinitely.
Reloading the webapp, clearing webapp cache, recreating the connection would do nothing at all
The log on the Postgres pod indicated a password issue with the current database/user, which never happened before
Both SQLalchemy and ADBC get the same postgres URI to create the engine/connection with
Reverting to SQLalchemy solved the problem, and the error has not been seen again

How can we reproduce the bug?

The given URI was "postgresql://{user}:{password}@{host}:{port}/{db}" formatted with the proper values
The function was used to create the ADBC connection:

def create_adbc_conn() -> Connection:
    logger_stdout.info(f"Creating a new ADBC connection at {pd.Timestamp.now()}.")
    uri = get_default_uri() # URI shown above, formatted
    connection = dbapi.connect(uri=uri)
    logger_stdout.info("ADBC connection created")
    return connection

The function to execute the SQL query was:

def handle_sql_query(
sql: str,
index_col: Optional[str] = None,
connection: Optional[Connection] = None,
need_to_close: bool = False,
) -> pd.DataFrame:
if connection is None:
    logger_stdout.info(f"Connection is None, creating a new ADBC connection at {pd.Timestamp.now()}.")
    connection= create_adbc_conn()
    need_to_close = True
try:
    logger_stdout.info("Executing SQL query with connection")
    return pd.read_sql_query(sql=sql, con=connection, index_col=index_col, parse_dates=[index_col])
finally:
    if need_to_close:
        logger_stdout.info("Closing the ADBC connection.")
        connection.close()

The SQL queries ranged from select * from TABLE_NAME to selecting specific columns on a range of specific dates

Environment/Setup

python 3.11 pandas == 2.2.2 adbc_driver_postgresql==0.11.0 adbc-driver-manager==0.11.0

zeroshade commented 3 months ago

@kou @lidavidm any ideas on this one? I'm not familiar enough with the postgres driver (or postgres in general) to have an idea where the problem is

lidavidm commented 3 months ago

Is it possible to share the log?

I think we'll have to set up a pod for 12 hours and see if we can reproduce at all...

kou commented 3 months ago

Do you set timeout related parameters such as idle_session_timeout? See https://www.postgresql.org/docs/current/runtime-config-client.html for other timeout related parameters?

Could you show SELECT * FROM pg_stat_activity when this problem is happen? See also: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW

Do you pass connection to handle_sql_query()? Or is connection=None used?

lidavidm commented 3 months ago

One other thing, if we want to try and reproduce this, what was the PostgreSQL version?

gaspardc-met commented 3 months ago

Hello all, thanks for the quick response.

@lidavidm : Log entries

2024-05-20 14:05:17.660 GMT [521061] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 14:13:31.910 GMT [521817] FATAL:  unsupported frontend protocol 0.0: server supports 3.0 to 3.0
2024-05-20 14:13:32.214 GMT [521818] FATAL:  unsupported frontend protocol 255.255: server supports 3.0 to 3.0
2024-05-20 14:13:32.518 GMT [521819] FATAL:  no PostgreSQL user name specified in startup packet
2024-05-20 14:15:07.995 GMT [521500] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 15:26:08.819 GMT [528434] LOG:  invalid length of startup packet

The could not receive data from client: Connection reset by peer happened a lot and is standard, I think it's a connection not properly closed by the webapp (now I close every connection) and happen with SQLalchemy too. The others are "new" and happened when the connection was stuck and not before

PG version:

psql --version: psql (PostgreSQL) 14.5
Docker image: docker.io/bitnami/postgresql:14.5.0-debian-11-r35

@kou :

I did not create the postgres pod, and I couldn't see any environment variables on this, so it might be just "vanilla" bitnami/postgresql
The problem happened both with passing a cached connection and creating a new connection. I tried both.

gaspardc-met commented 3 months ago

@kou :

I will try to recreate the issue next week locally so I can share the SELECT * FROM pg_stat_activity then
However I did run the command right now and noticed these irregular entries from my first attempt at deploying with ADBC (17th of mai then 20th of mai)

lidavidm commented 3 months ago

Thanks. Interesting, there's occasional reference to errors like this elsewhere 1, but usually the suggestion is that something is port-scanning Postgres. That doesn't seem likely here. On the other hand, if the client were doing something wrong after a long time, restarting the client completely should presumably reset that. So instead it seems like something borks the server.

Just to clarify, though:

(1) Did you try restarting the Postgres server, too? (2) Did you restart the Postgres server and then try with SQLAlchemy? (3) If not, then it sounds like: using ADBC for a while borks the server, but using SQLAlchemy (without restarting the server) works, and it's unknown whether switching back to ADBC would fail or work?

gaspardc-met commented 3 months ago

Never touched the postgres server pod before, during or after the issue.
Did not try restarting the server with ADBC
Using SQLalchemy seems to have fixed the issue between and after 1st and 2nd attempt with ADBC
I could probably just resume using ADBC right now without changing the previous code until the next issue

lidavidm commented 3 months ago

Thanks. I'll try to find time to set up a container and reproduce a setup like this.

lidavidm commented 3 months ago

Sorry, it's looking like I won't have time to investigate this anytime soon. This is on my backlog and I do hope to get to it but any help here is welcome

WillAyd commented 2 months ago

Does it matter at all if you remove pandas and use the connection directly to parse the results? Its possible there is also something with pandas that is causing the problem

apache / arrow-adbc