apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 86 forks source link

ADBC Python Postgres - Stuck connections to the database #1881

Open gaspardc-met opened 3 months ago

gaspardc-met commented 3 months ago

What happened?

Context before the bug (working):

Switching to ADBC:

Problem:

How can we reproduce the bug?

def create_adbc_conn() -> Connection:
    logger_stdout.info(f"Creating a new ADBC connection at {pd.Timestamp.now()}.")
    uri = get_default_uri() # URI shown above, formatted
    connection = dbapi.connect(uri=uri)
    logger_stdout.info("ADBC connection created")
    return connection

Environment/Setup

python 3.11 pandas == 2.2.2 adbc_driver_postgresql==0.11.0 adbc-driver-manager==0.11.0

zeroshade commented 3 months ago

@kou @lidavidm any ideas on this one? I'm not familiar enough with the postgres driver (or postgres in general) to have an idea where the problem is

lidavidm commented 3 months ago

Is it possible to share the log?

I think we'll have to set up a pod for 12 hours and see if we can reproduce at all...

kou commented 3 months ago

Do you set timeout related parameters such as idle_session_timeout? See https://www.postgresql.org/docs/current/runtime-config-client.html for other timeout related parameters?

Could you show SELECT * FROM pg_stat_activity when this problem is happen? See also: https://www.postgresql.org/docs/current/monitoring-stats.html#MONITORING-PG-STAT-ACTIVITY-VIEW

Do you pass connection to handle_sql_query()? Or is connection=None used?

lidavidm commented 3 months ago

One other thing, if we want to try and reproduce this, what was the PostgreSQL version?

gaspardc-met commented 3 months ago

Hello all, thanks for the quick response.

@lidavidm : Log entries

2024-05-20 14:05:17.660 GMT [521061] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 14:13:31.910 GMT [521817] FATAL:  unsupported frontend protocol 0.0: server supports 3.0 to 3.0
2024-05-20 14:13:32.214 GMT [521818] FATAL:  unsupported frontend protocol 255.255: server supports 3.0 to 3.0
2024-05-20 14:13:32.518 GMT [521819] FATAL:  no PostgreSQL user name specified in startup packet
2024-05-20 14:15:07.995 GMT [521500] LOG:  could not receive data from client: Connection reset by peer
2024-05-20 15:26:08.819 GMT [528434] LOG:  invalid length of startup packet

The could not receive data from client: Connection reset by peer happened a lot and is standard, I think it's a connection not properly closed by the webapp (now I close every connection) and happen with SQLalchemy too. The others are "new" and happened when the connection was stuck and not before

PG version:

@kou :

gaspardc-met commented 3 months ago

@kou :

lidavidm commented 3 months ago

Thanks. Interesting, there's occasional reference to errors like this elsewhere 1, but usually the suggestion is that something is port-scanning Postgres. That doesn't seem likely here. On the other hand, if the client were doing something wrong after a long time, restarting the client completely should presumably reset that. So instead it seems like something borks the server.

Just to clarify, though:

(1) Did you try restarting the Postgres server, too? (2) Did you restart the Postgres server and then try with SQLAlchemy? (3) If not, then it sounds like: using ADBC for a while borks the server, but using SQLAlchemy (without restarting the server) works, and it's unknown whether switching back to ADBC would fail or work?

gaspardc-met commented 3 months ago
  1. Never touched the postgres server pod before, during or after the issue.
  2. Did not try restarting the server with ADBC
  3. Using SQLalchemy seems to have fixed the issue between and after 1st and 2nd attempt with ADBC
  4. I could probably just resume using ADBC right now without changing the previous code until the next issue
lidavidm commented 3 months ago

Thanks. I'll try to find time to set up a container and reproduce a setup like this.

lidavidm commented 3 months ago

Sorry, it's looking like I won't have time to investigate this anytime soon. This is on my backlog and I do hope to get to it but any help here is welcome

WillAyd commented 2 months ago

Does it matter at all if you remove pandas and use the connection directly to parse the results? Its possible there is also something with pandas that is causing the problem