apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
175 stars 38 forks source link

Schema and ptype inference #347

Open krlmlr opened 11 months ago

krlmlr commented 11 months ago

For DBI.

library(nanoarrow)

df <- data.frame(a = 1:3, b = 4.5, c = "five")
schema <- nanoarrow::infer_nanoarrow_schema(df)
nanoarrow::infer_nanoarrow_schema(schema)
#> Error in infer_nanoarrow_schema.default(schema): Can't infer Arrow type for object of class nanoarrow_schema
nanoarrow::infer_nanoarrow_ptype(df)
#> Error in nanoarrow::infer_nanoarrow_ptype(df): `x` must be a nanoarrow_schema(), nanoarrow_array(), or nanoarrow_array_stream()
tibble::as_tibble(nanoarrow::infer_nanoarrow_ptype(schema))
#> # A tibble: 0 × 3
#> # ℹ 3 variables: a <int>, b <dbl>, c <chr>

Created on 2023-12-25 with reprex v2.0.2

paleolimbot commented 11 months ago

Should we allow infer_nanoarrow_schema() on schema objects?

I was initially rather careful to separate as_nanoarrow_schema() (i.e., I'm looking for a data type, get me that data type as a nanoarrow schema!) and infer_nanoarrow_schema() (i.e., what would the schema be after I call as_nanoarrow_array() on this thing). In R that's sort of confusing because we don't have data type objects (just zero-size vectors). At this point it seems like the lack of infer_nanoarrow_schema.nanoarrow_schema() is just adding confusion? I don't think it would hurt to add it.

Should we allow infer_nanoarrow_ptype() on R objects, effectively emulating data roundtrip?

I think that would maybe be confusing...vec_ptype() and infer_nanoarrow_ptype() may potentially return different things. If it were added to nanoarrow I would prefer to call it ptype_after_roundtrip() or something similarly descriptive. Or maybe DBI just wants to know if something will roundtrip or not?

krlmlr commented 11 months ago

Hm... I don't mind leaving this as is for a bit, it would just save a few lines of code. Going from strict to permissive is easy, but is it useful elsewhere? Let's see.