TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

Let's introduce unnest_wider() in TidierData.jl! #34

Closed atantos closed 9 months ago

atantos commented 1 year ago

Hey there!

In R's tidyverse, the unnest_wider() function provides a convenient way to spread the contents of a column, which contains arrays or lists of values, across multiple new columns. Let's consider a DataFrame named test and see how we'd like the result to appear:

test = DataFrame(a = [1,2], b = [["c","d"],["e", "f"]])

# result
result = DataFrame(a = [1,2], b_1 = ["c" , "e"], b_2 = ["d", "f"])

To achieve that with R's tidyverse we would have:

> data <- tibble(
  a = c(1,2),
  b = list(c("c","d"), c("e", "f"))
)

> data_wide <- data %>% 
  unnest_wider(b, names_sep = "_")

> data_wide
# A tibble: 2 × 3
      a b_1   b_2  
  <dbl> <chr> <chr>
1     1 c     d    
2     2 e     f    

To achieve a similar result in Julia using the DataFrames.jl package, the process is straightforward, albeit with a distinct Julia-idiomatic flavor. First, we'd define a function, split_uniformly(), to handle the transformation. Then, we'd use this function within the transformation pipeline provided by the DataFrames minilanguage:


julia> test = DataFrame(a = [1,2], b = [["c","d"],["e", "f"]])

julia> function split_uniformly(v)
    n = length(first(v))
    [NamedTuple(Symbol.("b", 1:n) .=> Tuple(amem))
     for amem in v]
end

julia> test_wide = @chain test begin
    transform(:b => split_uniformly => AsTable)
    select(Not(:b))
end
2×3 DataFrame
 Row │ a      b1      b2     
     │ Int64  String  String 
─────┼───────────────────────
   1 │     1  c       d
   2 │     2  e       f
kdpsingh commented 1 year ago

Love the suggestion!

Also, just to check, flatten() is the equivalent of unnest_longer(), right?

https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.flatten


julia> df1 = DataFrame(a=[1, 2], b=[[1, 2], [3, 4]], c=[[5, 6], [7, 8]])
2×3 DataFrame
 Row │ a      b       c
     │ Int64  Array…  Array…
─────┼───────────────────────
   1 │     1  [1, 2]  [5, 6]
   2 │     2  [3, 4]  [7, 8]

julia> flatten(df1, :b)
4×3 DataFrame
 Row │ a      b      c
     │ Int64  Int64  Array…
─────┼──────────────────────
   1 │     1      1  [5, 6]
   2 │     1      2  [5, 6]
   3 │     2      3  [7, 8]
   4 │     2      4  [7, 8]
atantos commented 1 year ago

Exactly! flatten() is the equivalent. However, I guess there are some decisions to be made over here..either keep flatten() and a new function with a similar name (but without unnest in front) that corresponds to R's unnest_wider() or for consistency with the tidyverse idea reshape flatten() a bit and name it unnest_longer(), while you also create a unnest_wider().

kdpsingh commented 9 months ago

@nest(), @unnest_longer(), and @unnest_wider() are now officially supported in v0.14.4.