Open mcrumiller opened 3 months ago
How do we think we should handle unsigned integers? Should they map to types that are equal in width and just raise on a possible overflow? Or should we inspect the data and maybe upscale to a larger width when it could hold the value losslessly?
I would recommend upcasting to the smallest integer type that can hold all of the values:
If I'm understanding your suggestion you just think we should do that across the board? i.e. uint16 will always be upcast to a 32 bit integer even if none of its values need that much width?
I assume from that with the uint64 case we would still choose a BIGINT and just raise on overflow
I'm also a little unsure about date64 support - date32 just aligns perfectly with the postgres storage. I'm not sure what value there would be in trying to convert date64 to date32 in the driver versus having the user/an upstream tool take care of that
Mostly for convenience, I think for date64 we can just divide and treat as date32. (So it won't/can't roundtrip.)
Unsigned types are a little iffier to me
So the date64 is always expected to be evenly divisible into a date right? Likely just my lack of understanding about that type, but I am wary that that object counts milliseconds
Yup
It's weird because it was meant to be 1:1 with Java's old Date (AFAIK) which uses the same representation. IMO, it probably shouldn't have made it, but so it goes.
If I'm understanding your suggestion you just think we should do that across the board?
@WillAyd postgres doesn't really have a natural mapping to the standard u/ints used in most languages (why they, and seemingly all other DBMS, didn't choose this design I'll never know). So the user cannot have the expectation that their source data type will be respected, when there's not even a guarantee that a commensurate type exists at all.
On top of that, if a user calls adbc_ingest
, there is no guarantee that they will not later insert data that won't fit. As an example, suppose a user has a uint8
column with values that go up to 120. If adbc then casts to i8 because it "fits" and then the user tries to insert a 200, they get an unpleasant surprise--which should not happen, because they are after all inserting data of the same dtype that they initially inserted.
I would recommend upcasting to the smallest integer type that can hold all of the values:
- uint8 -> int16
- uint16 -> int32
- uint32 -> int64
uint64 -> ???
uint64 -> ???
Maybe raising an error if the value is too large? As far as I know values > i64.max cannot be ingested into postgres.
Edit: there is a variable-sized decimal type that allows for up to 131,072 digits prior to the decimal point, but that feels too wonky.
The max i64 value is 9,223,372,036,854,775,807 which is pretty darn big. If people are using values larger than that, they probably are dealing something not well suited for basic int types.
I would recommend upcasting to the smallest integer type that can hold all of the values:
uint8 -> int16 uint16 -> int32 uint32 -> int64
This is definitely the safest way to go, and had already been baked in to PostgresType::FromSchema()
. In #2153 we updated the ingestion behaviour to actually use PostgresType::FromSchema()
, and so for a "create" or "append to something created by adbc" this should now work:
library(adbcdrivermanager)
library(nanoarrow)
con <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |>
adbc_connection_init()
df <- tibble::tibble(
uint8_col = 246:255,
uint16_col = 65526:65535,
uint32_col = (.Machine$integer.max + 1):(.Machine$integer.max + 10)
)
array <- df |>
nanoarrow::as_nanoarrow_array(
schema = na_struct(
list(
uint8_col = na_uint8(),
uint16_col = na_uint16(),
uint32_col = na_uint32()
)
)
)
con |>
execute_adbc("DROP TABLE IF EXISTS adbc_test")
array |>
write_adbc(con, "adbc_test")
con |>
read_adbc("select * from adbc_test") |>
tibble::as_tibble()
#> # A tibble: 10 × 3
#> uint8_col uint16_col uint32_col
#> <int> <int> <dbl>
#> 1 246 65526 2147483648
#> 2 247 65527 2147483649
#> 3 248 65528 2147483650
#> 4 249 65529 2147483651
#> 5 250 65530 2147483652
#> 6 251 65531 2147483653
#> 7 252 65532 2147483654
#> 8 253 65533 2147483655
#> 9 254 65534 2147483656
#> 10 255 65535 2147483657
Unfortunately, the method we're using to efficiently insert (generate COPY data) requires that the types match exactly, so this will fail to append to Arrow data that happens to have an unsigned integer column to an existing table:
con |>
execute_adbc("DROP TABLE IF EXISTS adbc_test")
con |>
execute_adbc("CREATE TABLE adbc_test (uint8_col int2, uint16_col int2, uint32_col int4)")
array |>
write_adbc(con, "adbc_test", mode = "append")
#> Error in adbc_statement_execute_query(stmt): INVALID_ARGUMENT: [libpq] Failed to execute COPY statement: PGRES_FATAL_ERROR ERROR: incorrect binary data format
#> CONTEXT: COPY adbc_test, line 1, column uint16_col
uint64 -> ???
For a fresh insert of Arrow data (i.e., when we are forced to generate a CREATE TABLE statement), this should probably be inferred as bigint
because it preserves the "integerness", even though it will error at runtime if passed very large values. The workaround here would be to issue your own CREATE TABLE (bigint_col BYTEA)
, and we will need to support generating COPY for various postgres types outside the 1:1 mapping that is currently used by the COPY writer.
What happened?
adbc_ingest
fails for any dataframe that contains an unsigned int dtype and some (but not all) temporal dtypes.Environment/Setup
adbc_driver_postgresql==1.0.0