apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 87 forks source link

Bulk Ingestion into Postgresql Numeric with scale #2177

Open ibnubay opened 1 day ago

ibnubay commented 1 day ago

What would you like help with?

Env:

Background: I want to load data into Postgresql using Bulk Ingestion/Copy command from parquet file. I have pyarrow "amount" column with pa.float64(), to be inserted into Numeric with scale column in Postgresql "amount numeric(18, 2)".

Then I got this error below: Exception: ProgrammingError('INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR: insufficient data left in message\nCONTEXT: COPY invoice_new, line 1, column amount\n. SQLSTATE: 08P01')

And I tried to change pyarrow schema "amount" column into "decimal256(18, 2)" I got this error below: Exception: ProgrammingError('INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR: invalid scale in external "numeric" value\nCONTEXT: COPY invoice_new, line 1, column amount\n. SQLSTATE: 22P03')

To fix this error, do I need cast pyarrow "amount" column schema to another schema type?

Ty

WillAyd commented 1 day ago

Does it matter if you use decimal128 versus decimal256? The former should work just fine with the provided precision / scale

ibnubay commented 1 day ago

Sorry @WillAyd , It's an error when "amount" value like this "12222.00" or that have decimal with zero value. I already tried with "decimal128 or decimal256", still same error above Any suggestion?

paleolimbot commented 21 hours ago

I think this is because the bulk insert feature requires that the schemas match exactly and because we don't take into account the existing column types when performing a bulk insertion (we should!). Specifically, that would mean that we should add a PostgresType argument here:

https://github.com/apache/arrow-adbc/blob/46dc748423dd4a03b227e7cd20a13898ac231dd2/c/driver/postgresql/copy/writer.h#L568-L571

...and return our field writers based on a many-to-many mapping (e.g., so that we can generate valid COPY for a numeric type, for example, based on a variety of types of arrow inputs.

In the meantime, you should be able to use a parameterized INSERT as a workaround (i.e., INSERT INTO some_table VALUES (?). (I forget exactly how to access the bulk bind via dbapi/ADBC in Python). I believe there was another issue where somebody did a bulk insert into a temporary table and used SQL to do the type casting/insert.

WillAyd commented 21 hours ago

In the meantime, you should be able to use a parameterized INSERT as a workaround (i.e., INSERT INTO some_table VALUES (?). (I forget exactly how to access the bulk bind via dbapi/ADBC in Python)

I think you are looking for something like this:

tbl = pa.Table(...)
with adbc_driver_postgresql.dbapi(conn_args) as conn, conn.cursor() as cur:
    cur.adbc_prepare("INSERT INTO some_table VALUES (?)")
    cur.executemany("INSERT INTO some_table VALUES (?)", tbl)
    conn.commit()
lidavidm commented 17 hours ago

This won't be as fast as the bulk insert, just to be clear (but I think it should be a bit better than other drivers since we can still at least use the binary format? It's only that we have to do a server roundtrip for each row in this case)

ibnubay commented 8 hours ago

I try to do another approach to tackle insert "Decimal" data into Postgresql. Like in another issue about retrieve "Decimal" data from Postgresql, on "staging" table I put decimal data into "String" schema then upsert into "main" table I do convert that decimal from "VARCHAR" into "DECIMAL" using sql.

I think that's enough for me, even must add doing "decimal transformation".

Ty guys, for help.