apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
149 stars 34 forks source link

[R] Convert arrow dictionary to R factor via `as.data.frame.nanoarrow_array_stream()`? #513

Open eitsupi opened 3 weeks ago

eitsupi commented 3 weeks ago

Maybe related to #220

I noticed that if we convert nanoarrow_array_stream to data.frame, the dictionary becomes a character.

stream <- data.frame(
  x = as.factor(letters[1:5]),
  y = as.factor(1:5)
) |>
  nanoarrow::as_nanoarrow_array_stream()

stream
#> <nanoarrow_array_stream struct<x: dictionary(int32)<string>, y: dictionary(int32)<string>>>
#>  $ get_schema:function ()
#>  $ get_next  :function (schema = x$get_schema(), validate = TRUE)
#>  $ release   :function ()

stream |>
  tibble::as_tibble()
#> # A tibble: 5 × 2
#>   x     y
#>   <chr> <chr>
#> 1 a     1
#> 2 b     2
#> 3 c     3
#> 4 d     4
#> 5 e     5

Created on 2024-06-09 with reprex v2.1.0

paleolimbot commented 3 weeks ago

Thanks for bringing this up!

One of the tricky things about dictionaries in Arrow is that the "levels"/"dictionary" live at the array level, not at the type level. This means that two arrays can be a dictionary(int32, string) but each have its own dictionary. Arrow C++ (and therefore arrow R) handles this with a rather complex system of "dictionary unification", which it can do because it has equality kernels and can do fancy things. nanoarrow doesn't have any of that, so I made the default conversion a little simpler (and did it in such a way that it handles dictionaries of things that aren't just strings in a more predictable way, or at least more stable if unexpected to the average R user).

You should be able to specify that you want a factor() specifically, and this will work for converting just one batch. If you need to convert an arbitrary stream, you'll need to know the levels in advance at the moment (this could be fixed such that it "learns" the levels as it goes and finalizes the array at the end...basically an implementation of dictionary unification written in R).

library(nanoarrow)
#> Warning: package 'nanoarrow' was built under R version 4.3.3

df1 <- data.frame(
  x = as.factor(letters[1:5]),
  y = as.factor(1:5)
)

df2 <- data.frame(
  x = as.factor(letters[6:10]),
  y = as.factor(1:5)
)

# Safest/most type stable/makes the fewest assumptions to just return
# the dictionary value type
basic_array_stream(list(df1, df2)) |> 
  convert_array_stream() |> 
  tibble::as_tibble()
#> # A tibble: 10 × 2
#>    x     y    
#>    <chr> <chr>
#>  1 a     1    
#>  2 b     2    
#>  3 c     3    
#>  4 d     4    
#>  5 e     5    
#>  6 f     1    
#>  7 g     2    
#>  8 h     3    
#>  9 i     4    
#> 10 j     5

# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1, df2)) |> 
  convert_array_stream(
    data.frame(x = factor(levels = letters), y = factor(levels = as.character(1:5)))
  ) |> 
  tibble::as_tibble()
#> # A tibble: 10 × 2
#>    x     y    
#>    <fct> <fct>
#>  1 a     1    
#>  2 b     2    
#>  3 c     3    
#>  4 d     4    
#>  5 e     5    
#>  6 f     1    
#>  7 g     2    
#>  8 h     3    
#>  9 i     4    
#> 10 j     5

# If you have only one batch, factor() should work as a target (but doesn't currently)
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1)) |> 
  convert_array_stream(
    data.frame(x = factor(), y = factor())
  ) |> 
  tibble::as_tibble()
#> # A tibble: 5 × 2
#>   x     y    
#>   <fct> <fct>
#> 1 a     1    
#> 2 b     2    
#> 3 c     3    
#> 4 d     4    
#> 5 e     5

Created on 2024-06-09 with reprex v2.1.0

eitsupi commented 3 weeks ago

One of the tricky things about dictionaries in Arrow is that the "levels"/"dictionary" live at the array level, not at the type level.

Thanks for the detailed explanation. I see, this is indeed a complicated process.

Perhaps the statistics on the C interface that are currently being discussed could provide some sort of dictionary for the entire column...?

paleolimbot commented 3 weeks ago

Perhaps the statistics on the C interface that are currently being discussed could provide some sort of dictionary for the entire column...?

I think that convert_array_stream(stream, factor()) could be smarter: when I first implemented the "convert to R" logic I didn't allow for any flexibility with respect to "finalizing" a value. I had to bite that bullet to support GeoArrow (i.e., with the nanoarrow_vctr()), but at the moment attempting to convert a stream with a target factor() will error.

There is also a PR open to refactor the conversion process to make it easier to add these features: https://github.com/apache/arrow-nanoarrow/pull/392