apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
385 stars 97 forks source link

unable connect to Redshift using adbc_driver_postgresql #1563

Open j0bekt01 opened 9 months ago

j0bekt01 commented 9 months ago

What happened?

I was attempting to create a connection to Redshift using adbc_driver_postgresql but I am getting this error: IO: [libpq] Failed to build type mapping table: ERROR: column "typarray" does not exist in pg_type. I am aware that Redshift is based on an earlier version of Postgres. Is it currently possible to connect to Redshift? if not, are there plans to do so?

Environment/Setup 4.14.336-253.554.amzn2.x86_64 Python 3.9.18 adbc-driver-manager 0.10.0 adbc-driver-postgresql 0.10.0 pyarrow 14.0.2

lidavidm commented 9 months ago

Hmm, it looks like Redshift uses the wire protocol but isn't quite compatible, so we'd have to handle it separately. Even then, maintaining CI would be a challenge, plus it appears it's based on PostgreSQL 8? We've had issues in the past with older PostgreSQL versions, too.

lidavidm commented 9 months ago

Specifically for the error here, we are querying the database metadata at startup to figure out which column types map to which Arrow types, but Redshift appears to not provide one of the required columns.

Sairam90 commented 1 month ago

any updates on this issue ?

paleolimbot commented 1 month ago

We can probably solve the specific issue of not having the column typarray. I don't have a way to test on Redshift but would be happy to look into it given a way to do so!

gtomitsuka commented 1 month ago

We're blocked on a transition to ADBC by this; TextQL is happy to provide access to one of our demo data warehouses with their default sample database for a few weeks – it's serverless, so may require a cold start sometimes, but works very well. I assume you won't need more than a few thousand queries used for this.

Who should I talk to about this? Feel free to reach out, gabriel@textql.com.

I may be able to contribute myself in a month or so if the issue turns out to be much bigger than expected. This guide about the differences is pretty decent.

paleolimbot commented 1 month ago

I'll send you a note...I happen to be working on the Postgres driver right now and I'm happy to kick the tires a little to see if I can get it to work. To claim Redshift support long term we would need a set of more permanent credentials that we can use in CI, but there are a few companies that are doing this kind of testing anyway that might be willing to share with us if we can demonstrate a POC 🙂 .

gtomitsuka commented 1 month ago

My CTO is currently on honeymoon; I might be able to get long-term credentials for you too but would definitely need his approval first. Will keep you updated!

Responded to your email with credentials for you, might be able to get more for more people if needed.

j0bekt01 commented 1 month ago

Supporting Redshift will significantly boost the exposure and usage of this driver. In my experience working with databases, particularly Redshift, the primary bottleneck is extracting the data fast, and this support will greatly alleviate that issue. ConnectorX has gained popularity as the default engine for Polars, but it only supports reading operations. For writing, one must use SQLAlchemy, which is less performant. Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command. The ability to both read and write with adbc_driver would be a game-changer. I am happy to help contribute in any way I can.

WillAyd commented 1 month ago

Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command.

I believe you can also just use the AWS SDK for pandas (previously known as awswrangler) which does this automatically for you. That may be the best of your intermediate options, although yea having ADBC support in the near term is probably best

j0bekt01 commented 1 month ago

Alternatively, you can use the Redshift COPY command, which involves dumping data to S3 via Arrow and then using the Redshift Data API to submit a COPY command.

I believe you can also just use the AWS SDK for pandas (previously known as awswrangler) which does this automatically for you. That may be the best of your intermediate options, although yea having ADBC support in the near term is probably best

Yes, you are correct, awswrangler handles exactly this internally. However, I prefer using Polars over Pandas, and I know conversion between the two is possible. I just believe that supporting a premier data warehouse (Redshift) would solidify this as a strong alternative to JDBC/ODBC drivers.

WillAyd commented 1 month ago

Does the Apache project get any promotional credits from cloud providers? While having someone donate access is nice, it might be best safest if we had a dedicated redshift to connect to over OIDC (ditto for Google cloud and the Bigquerry driver)

This article is pretty old but I wonder if it still applies:

https://aws.amazon.com/blogs/opensource/aws-promotional-credits-open-source-projects/

lidavidm commented 1 month ago

I do not believe we get any credits, unfortunately.