apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 86 forks source link

adbc_driver_postgres flatten multi dimensional array in Postgres #2063

Open andrenmo opened 1 month ago

andrenmo commented 1 month ago

What happened?

When querying a multi dimensional array in Postgres the returned result "loose" the dimensions, e. g. a 2 dimension array is returned as 1 dimension array, the query:

select array[[1, 1], [2, 3]]

It should return:

[([[1, 1],[2, 3]],)]

It returns:

[([1, 1, 2, 3],)]

Stack Trace

No response

How can we reproduce the bug?

import adbc_driver_postgresql.dbapi

conn_string = 'postgresql://postgres:XXXX@localhost:5432/postgres'
conn = adbc_driver_postgresql.dbapi.connect(conn_string)

cursor = conn.cursor()
query = 'select array[[1, 1], [2, 3]]'
cursor.execute(query)
cursor.fetchall()

Environment/Setup

macOS PostgreSQL 15 adbc-driver-postgresql 1.1.0

paleolimbot commented 1 month ago

Thank you for opening!

One of the challenges with the Postgres array type is that it can (in theory) have a different number of dimensions for each item, whereas in Arrow land we need to know the number of dimensions in advance. For everything except a single dimension (where Arrow's list type can handle items of different sizes), we would need to know the complete shape in advance (and perhaps return the new tensor extension type).

In the near term we could error when we encounter a number of dimensions that is not exactly one? I don't think we have a mechanism to "warn" or "notice" that we have potentially lost some information.

andrenmo commented 1 month ago

Thanks for the response, in a ideal world this query:

select array[[1, 1], [2, 3]]
union all
select array[[1, 1]]

would return

[([[1, 1], [2, 3]],), ([[1, 1], [None, None]],)]

but an warning (or error that can be silenced with a flag) at least can help to debug why the data is different.

paleolimbot commented 1 month ago

but an warning (or error that can be silenced with a flag) at least can help to debug why the data is different.

We could probably wire up an error that could be silenced with a statement option today. There is an open issue about warnings/notices which might require a spec change (https://github.com/apache/arrow-adbc/issues/1243).

This is another place where the ability to request a schema (https://github.com/apache/arrow-adbc/issues/1514) would be helpful as a workaround. It is difficult for the driver to predict this kind of thing without inventing an extension type or implementing SQLite-ish type learning of some kind, but a user might know that they are going to need (e.g.) a 2D array out of a specific column in their result.