c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes

mcrumiller commented 3 months ago

What happened?

adbc_ingest fails for any dataframe that contains an unsigned int dtype and some (but not all) temporal dtypes.

import pyarrow as pa
from adbc_driver_postgresql import dbapi

uri = "postgresql://mcrumiller@winhost/mydb"

def try_ingest(data, dtype):
    """Attempt to write single-column df using ADBC."""
    df = pa.Table.from_arrays([data], schema=pa.schema([("a", dtype)]))
    conn = dbapi.connect(uri)
    cur = conn.cursor()
    try:
        cur.adbc_ingest("test_dtype", df, mode="replace")
        print(f"{dtype}: OK")
    except Exception as e:
        print(f"{dtype}: {e}")

# these all fail
try_ingest([0], pa.uint8())
try_ingest([0], pa.uint16())
try_ingest([0], pa.uint32())
try_ingest([0], pa.uint64())
try_ingest([0], dtype=pa.time32("s"))
try_ingest([0], dtype=pa.time32("ms"))
try_ingest([0], dtype=pa.time64("us"))
try_ingest([0], dtype=pa.time64("ns"))
try_ingest([0], dtype=pa.date64())

# these all pass
try_ingest([True, True, False], pa.bool_())
try_ingest([0], dtype=pa.int8())
try_ingest([0], dtype=pa.int16())
try_ingest([0], dtype=pa.int32())
try_ingest([0], dtype=pa.int64())
try_ingest([0], dtype=pa.float32())
try_ingest([0], dtype=pa.float64())
try_ingest([0], dtype=pa.timestamp("s"))
try_ingest([0], dtype=pa.timestamp("ms"))
try_ingest([0], dtype=pa.timestamp("us"))
try_ingest([0], dtype=pa.timestamp("ns"))
try_ingest([0], dtype=pa.duration("s"))
try_ingest([0], dtype=pa.duration("ms"))
try_ingest([0], dtype=pa.duration("us"))
try_ingest([0], dtype=pa.duration("ns"))
try_ingest([0], dtype=pa.date32())
try_ingest(["a", "b", "c"], pa.string())
try_ingest(["a", "b", "c"], pa.binary())

uint8: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint8
uint16: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint16
uint32: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint32
uint64: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion uint64
time32[s]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time32[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time32
time64[us]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
time64[ns]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion time64
date64[ms]: NOT_IMPLEMENTED: [libpq] Field #1 ('a') has unsupported type for ingestion date64
bool: OK
int8: OK
int16: OK
int32: OK
int64: OK
float: OK
double: OK
timestamp[s]: OK
timestamp[ms]: OK
timestamp[us]: OK
timestamp[ns]: OK
duration[s]: OK
duration[ms]: OK
duration[us]: OK
duration[ns]: OK
date32[day]: OK
string: OK
binary: OK

Environment/Setup

adbc_driver_postgresql==1.0.0

WillAyd commented 3 months ago

How do we think we should handle unsigned integers? Should they map to types that are equal in width and just raise on a possible overflow? Or should we inspect the data and maybe upscale to a larger width when it could hold the value losslessly?

mcrumiller commented 3 months ago

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16
uint16 -> int32
uint32 -> int64

WillAyd commented 3 months ago

If I'm understanding your suggestion you just think we should do that across the board? i.e. uint16 will always be upcast to a 32 bit integer even if none of its values need that much width?

I assume from that with the uint64 case we would still choose a BIGINT and just raise on overflow

WillAyd commented 3 months ago

I'm also a little unsure about date64 support - date32 just aligns perfectly with the postgres storage. I'm not sure what value there would be in trying to convert date64 to date32 in the driver versus having the user/an upstream tool take care of that

lidavidm commented 3 months ago

Mostly for convenience, I think for date64 we can just divide and treat as date32. (So it won't/can't roundtrip.)

Unsigned types are a little iffier to me

WillAyd commented 3 months ago

So the date64 is always expected to be evenly divisible into a date right? Likely just my lack of understanding about that type, but I am wary that that object counts milliseconds

lidavidm commented 3 months ago

Yup

https://github.com/apache/arrow/blob/edfa343eeca008513f0300924380e1b187cc976b/format/Schema.fbs#L248-L249

It's weird because it was meant to be 1:1 with Java's old Date (AFAIK) which uses the same representation. IMO, it probably shouldn't have made it, but so it goes.

mcrumiller commented 3 months ago

If I'm understanding your suggestion you just think we should do that across the board?

@WillAyd postgres doesn't really have a natural mapping to the standard u/ints used in most languages (why they, and seemingly all other DBMS, didn't choose this design I'll never know). So the user cannot have the expectation that their source data type will be respected, when there's not even a guarantee that a commensurate type exists at all.

On top of that, if a user calls adbc_ingest, there is no guarantee that they will not later insert data that won't fit. As an example, suppose a user has a uint8 column with values that go up to 120. If adbc then casts to i8 because it "fits" and then the user tries to insert a 200, they get an unpleasant surprise--which should not happen, because they are after all inserting data of the same dtype that they initially inserted.

scottywakefield commented 3 weeks ago

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16

uint16 -> int32

uint32 -> int64

uint64 -> ???

mcrumiller commented 3 weeks ago

uint64 -> ???

Maybe raising an error if the value is too large? As far as I know values > i64.max cannot be ingested into postgres.

Edit: there is a variable-sized decimal type that allows for up to 131,072 digits prior to the decimal point, but that feels too wonky.

The max i64 value is 9,223,372,036,854,775,807 which is pretty darn big. If people are using values larger than that, they probably are dealing something not well suited for basic int types.

paleolimbot commented 3 weeks ago

I would recommend upcasting to the smallest integer type that can hold all of the values:

uint8 -> int16 uint16 -> int32 uint32 -> int64

This is definitely the safest way to go, and had already been baked in to PostgresType::FromSchema(). In #2153 we updated the ingestion behaviour to actually use PostgresType::FromSchema(), and so for a "create" or "append to something created by adbc" this should now work:

library(adbcdrivermanager)
library(nanoarrow)

con <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |> 
  adbc_connection_init()

df <- tibble::tibble(
  uint8_col = 246:255,
  uint16_col = 65526:65535,
  uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
)

array <- df |> 
  nanoarrow::as_nanoarrow_array(
    schema = na_struct(
      list(
        uint8_col = na_uint8(),
        uint16_col = na_uint16(),
        uint32_col = na_uint32()
      )
    )
  )

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")

array |> 
  write_adbc(con, "adbc_test")

con |> 
  read_adbc("select * from adbc_test") |> 
  tibble::as_tibble()
#> # A tibble: 10 × 3
#>    uint8_col uint16_col uint32_col
#>        <int>      <int>      <dbl>
#>  1       246      65526 2147483648
#>  2       247      65527 2147483649
#>  3       248      65528 2147483650
#>  4       249      65529 2147483651
#>  5       250      65530 2147483652
#>  6       251      65531 2147483653
#>  7       252      65532 2147483654
#>  8       253      65533 2147483655
#>  9       254      65534 2147483656
#> 10       255      65535 2147483657

Unfortunately, the method we're using to efficiently insert (generate COPY data) requires that the types match exactly, so this will fail to append to Arrow data that happens to have an unsigned integer column to an existing table:

con |> 
  execute_adbc("DROP TABLE IF EXISTS adbc_test")
con |> 
  execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, uint32_col int4)")
array |> 
  write_adbc(con, "adbc_test", mode = "append")
#> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR:  incorrect binary data format
#> CONTEXT:  COPY adbc_test, line 1, column uint16_col

uint64 -> ???

For a fresh insert of Arrow data (i.e., when we are forced to generate a CREATE TABLE statement), this should probably be inferred as bigint because it preserves the "integerness", even though it will error at runtime if passed very large values. The workaround here would be to issue your own CREATE TABLE (bigint_col BYTEA), and we will need to support generating COPY for various postgres types outside the 1:1 mapping that is currently used by the COPY writer.

apache / arrow-adbc

c/driver/postgresql: `adbc_ingest` fails for unsigned int and some temporal dtypes #1950

What happened?

Environment/Setup