apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 87 forks source link

feat(c/driver/postgresql): Add support for bulk ingestion of list types #1882

Closed judahrand closed 2 months ago

judahrand commented 3 months ago

What feature or improvement would you like to see?

These types would include List, LargeList, FixedSizeList and could also include FixedSizedTensor.

cpcloud commented 3 months ago

+1 here, this blocks us using the adbc postgres driver in ibis (which would be great, it removes a lot of very slow ingest hacks that we have)

paleolimbot commented 3 months ago

Just spotted this and thought I'd leave some thoughts on how to do this (for me or the next person that gets here!):

We have support for the other direction (Postgres Array to List), and I don't think it is too hard to reverse engineer that:

https://github.com/apache/arrow-adbc/blob/573c1130f3f2e8ecb6800d904545d34ff53c0cf8/c/driver/postgresql/copy/reader.h#L446-L532

...some test data that could be used to write a test:

https://github.com/apache/arrow-adbc/blob/573c1130f3f2e8ecb6800d904545d34ff53c0cf8/c/driver/postgresql/copy/postgres_copy_test_common.h#L129-L140

...the test for the read direction:

https://github.com/apache/arrow-adbc/blob/573c1130f3f2e8ecb6800d904545d34ff53c0cf8/c/driver/postgresql/copy/reader.h#L446-L532

lidavidm commented 3 months ago

I think the one thing that tripped me up when I took a look is that the array binary format includes the OID of the array elements. On read, we just ignore it, but we need to figure out the right OID to use on write (and I think none of the other writers need to care about OID).

It's possible Postgres ignores it too, though! Didn't test.

paleolimbot commented 3 months ago

I just did a quick browse and I'm wondering if this method will help:

https://github.com/apache/arrow-adbc/blob/00f15269ae3d04d5c018c04202663b9caba0baf7/c/driver/postgresql/postgres_type.h#L367-L377

lidavidm commented 3 months ago

I think we need the reverse? Given the array OID find the child OID. (And in this case the array OID may not yet exist, if we're creating the table...maybe this is more complicated than it seems.)

paleolimbot commented 3 months ago

With an up-to-date PostgresTypeResolver I think that's:

https://github.com/apache/arrow-adbc/blob/00f15269ae3d04d5c018c04202663b9caba0baf7/c/driver/postgresql/postgres_type.h#L355

...plus:

https://github.com/apache/arrow-adbc/blob/00f15269ae3d04d5c018c04202663b9caba0baf7/c/driver/postgresql/postgres_type.h#L185

...but perhaps getting an up-to-date PostgresTypeResolver is difficult here.

lidavidm commented 3 months ago

Aha. Thanks!

We can always plumb through the resolver but it seems like we should have enough info.