TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

Adds @unnest_wider, @unnest_longer, and @nest #77

Closed drizk1 closed 9 months ago

drizk1 commented 10 months ago

This pull request got a little bigger than I initially anticipated, but the four added macros all support tidy selection and interpolation.

These two #34 support grouped dataframes (ungroup -> regroup)

After the unnests, I thought I would try nest. I was struggling with some syntax issues keeping @nest(df, by = , key = ) in the same macro with @nest(df, nested_col = cols) method, so I ended up splitting them for the sake of simplicity.

Before going further and writing brief documentation for the nests, I thought I would check in. Should I drop the nests from theis PR for now, while I try to sort out grouping and while I continue to try reducing them back into just 1 macro.

I also added tidy selection to @unite.

kdpsingh commented 10 months ago

Thanks so much, @drizk1! Let me take a look at the @nest() syntax and see if I can make it line up with tidyr. Don't remove it -- leave it in for now. I just want to see if I can figure out how to make it match.

This is exciting!

drizk1 commented 10 months ago

Sounds good! For context, originally, after separating them into two underlying functions, I thought I could use multiple dispatch with 2 @nest macros but that was throwing errors so I split the macro in 2.

For @nest I tried pairs before settling on the method above. Using pairs made tidy selection tricky.

For @nest_by, I also couldn't figure out how to leverage keyword arguments with tidy selection so that the argument names would be visible, but perhaps there is a way.

I'm also happy to go back to the drawing board and see if I can make one function that performs the different nests to make the macro syntax match more easily.

kdpsingh commented 10 months ago

Ah, this may be because while functions can perform dispatch based on types, macros can only perform dispatch based on the number of arguments. This is because macros see all arguments as expressions, so they are all the same type.

kdpsingh commented 10 months ago

Looking at the tidyr nest_by() documentation, it looks like this function returns a rowwise data frame. I have mostly figured out how to implement rowwise data frames but I would wait to add @nest_by() until rowwise is properly implemented. We don't have to remove the codebase - I may just comment it out for now and focus on @nest().

drizk1 commented 10 months ago

Re the macros: ok this is great to know for the future.

Thank you for clarifying the rowwise aspect. I spent some time reading the cheat sheet and documentation again this morning, and it lines up in my mind more now.

Focusing on nest and commenting out nest_by until tidier has the rowwise dataframe ability sounds good to me. Let me know if there's I can do to help.

kdpsingh commented 10 months ago

At this point, the main thing I'd like to see is support for grouped data frames in @nest(). It doesn't need to support by = yet because none of the TidierData macros support that yet.

If you pass it a grouped data frame, it should separately nest the selected columns into a data frame for each group, with one nested data frame per group.

drizk1 commented 10 months ago

Sweet. I was tinkering with that last night actually. I think I know how to make it happen now so I'll try to make it official in the next day or two

drizk1 commented 10 months ago

Alright I sorted out the grouping. While doing so, I realized tidyr nests into tibbles, which might be more similar to nesting into dataframes, than to the arrays I was nesting into.

Nesting into dataframes was just a few lines to changes, so switching it is no problem.

The question I now have is which one would you prefer it nests into? My understanding is that arrays may be less memory intensive but also less flexible than a dataframe?

We could theoretically offer an argument so the user can choose ?

I'm open to anything, but fully defer the decision to you, and I will implement it.

kdpsingh commented 10 months ago

Ooh thanks for catching this. I think we should nest into DataFrames, which is important if you are nesting multiple columns. Let's not implement an option for alternatives for now. Just make sure that the unnesting works correctly if we nest into DataFrames. Once you do that, I'll review and merge. Exciting!

drizk1 commented 10 months ago

Alright, so now, unnesting supports dataframes.

And @nest nests into datafames.

I tested it against with the following tidyr example that has dataframes of different lengths and achieved identical results for the following four examples so it should be ready to go

df = DataFrame(
    x = 1:3,
    y = Any[
        DataFrame(),
        DataFrame(a = [1], b = [2]),
        DataFrame(a = 1:3, b = [3, 2, 1], c = [4, 4, 4])
    ]
)

@chain df begin 
  @unnest_wider(y)
  @unnest_longer(a:c, keep_empty = true)
end

@chain df begin 
  @unnest_wider(y)
  @unnest_longer(a:c, keep_empty = false)
end

@chain df begin 
  @unnest_longer(y, keep_empty = true)
  @unnest_wider(y)
end

@chain df begin 
  @unnest_longer(y, keep_empty = false)
  @unnest_wider(y)
end
kdpsingh commented 10 months ago

This looks amazing! I will review and merge soon.

kdpsingh commented 10 months ago

Great work thus far. Discovered one issue.

@nest() doesn't quite match up with the nest() behavior in R tidyr.

For example, in R, nesting multiple columns produces this:

> df = tibble(a = rep(letters[1:5], each = 3), b = 1:15, c = 16:30)
> df |> nest(data = b:c)
# A tibble: 5 × 2
  a     data            
  <chr> <list>          
1 a     <tibble [3 × 2]>
2 b     <tibble [3 × 2]>
3 c     <tibble [3 × 2]>
4 d     <tibble [3 × 2]>
5 e     <tibble [3 × 2]>

And in this PR, nesting multiple columns produces this:

julia> df = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
julia> @chain df @nest(data = b:c)
15×2 DataFrame
 Row │ a     data          
     │ Char  DataFrame     
─────┼─────────────────────
   1 │ a     1×2 DataFrame 
   2 │ a     1×2 DataFrame 
   3 │ a     1×2 DataFrame 
   4 │ b     1×2 DataFrame 
   5 │ b     1×2 DataFrame 
   6 │ b     1×2 DataFrame 
   7 │ c     1×2 DataFrame 
   8 │ c     1×2 DataFrame 
   9 │ c     1×2 DataFrame 
  10 │ d     1×2 DataFrame 
  11 │ d     1×2 DataFrame 
  12 │ d     1×2 DataFrame 
  13 │ e     1×2 DataFrame 
  14 │ e     1×2 DataFrame 
  15 │ e     1×2 DataFrame 

Any thoughts on how to fix?

drizk1 commented 10 months ago

Oh wow. Great catch. It is almost as if it groups it based on the remaining columns and then nests. I think using setdiff to groupby the outer dataframe columns, and then converting the grouped dataframes to dataframes and nesting them might work.. I will play around with it.

edit: it works for 1 nest, now trying to sort out when its nesting multiple

kdpsingh commented 10 months ago

Hmm this gives me an idea. It might be possible to implement @nest() in just a couple lines of code.

kdpsingh commented 10 months ago

Actually, go ahead with modifying what you have now and see if you can get it working.

There's one parsing functionality I'd need to add still to get this working with less code.

So I'll revisit this if you can't get it working the way you have it.

drizk1 commented 10 months ago

Alright, @nest is now correctly determining the number of rows based on the groups and it supports multiple columns to group by for the outer df. from the example above:

5×2 DataFrame
 Row │ a     data          
     │ Char  DataFrame     
─────┼─────────────────────
   1 │ a     3×2 DataFrame 
   2 │ b     3×2 DataFrame 
   3 │ c     3×2 DataFrame 
   4 │ d     3×2 DataFrame 
   5 │ e     3×2 DataFrame 

## this matches R as well
df = DataFrame(x = [1, 1, 1, 2, 2, 3], y = 1:6, z = 13:18, a = 7:12, ab = 12:-1:7);
@nest(df, n2 = starts_with("a"), n3 = (y:z))
3×3 DataFrame
 Row │ x      n3             n2            
     │ Int64  DataFrame      DataFrame     
─────┼─────────────────────────────────────
   1 │     1  3×2 DataFrame  1×2 DataFrame 
   2 │     2  2×2 DataFrame  1×2 DataFrame 
   3 │     3  1×2 DataFrame  1×2 DataFrame 

I will note tho, that when trying to unnest multiple nested columns that i nest in this second example above, I am getting slightly different dimensions than with R. I suspect this might have to do with the slightly different behavior or unnest_wider (illustrated below - in Julia it won't add new rows, but in R it will)? Of note, when using only unnest_longer and unnest_wider in R, for the 6x5 df above, it does not return to a 6x5. It only does so if unnesting with unnest() which I have not yet tried building.

Depending on what you think, I think i may have to go back and rework unnest_longer and unnest_wider given the example below.

in R to go back to original df

df4 <- data.frame(
  x = c("a", "b", "a", "b", "C", "a"),
  y = c("e", "e", "e", "f", "f", "3"),
  yz = 13:18,
  a = 7:12,
  ab = 12:7
)
test4 = nest(df4, n2 = a:ab)

test4 %>% 
  unnest_wider(n2)

in julia to go back to original df

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 =  @nest(df4, n2 = a:ab)
@chain test4 begin
  @unnest_wider(n2)
  @unnest_longer(a:ab)
end
kdpsingh commented 10 months ago

Thanks for the update. I'll take a look and see if I can figure out why it's behaving differently.

While I am eager to merge, I want to make sure things behave similarly across the implementations, especially for the use case where we nest and then unnest.

drizk1 commented 10 months ago

I totally agree.

please ignore the two commits below, and frankly most of my more recent comment above. they were my mind playing tricks on me.

#same results as in R 
df44 = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
dfdf = @chain df44 @nest(data = b:c)
@chain dfdf @unnest_wider(data) @unnest_longer(b:c)

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 = @nest(df4, n2 = yz:ab)
@chain test4 @unnest_wider(n2) @unnest_longer(yz:ab)

unnesting multiple columns of nests back to the orignal dataframes is the last frontier I think. I still get multiple dimensions for that.

Edit: I finally figured out where the bug is. the bug is not in either of the unnests, but in the nest when nesting multiple sets of columns at once. The example below illustrate how some of the cells are not properly populated as the below yields differences from R.

test2 = @nest(df4, n2 = a:ab, n3 = y:yz)
@chain test2 begin @unnest_wider(n2:n3) 
drizk1 commented 10 months ago

Sorry to have taken you on a journey of excess commits over the last week. I have a deep appreciation for un/nesting now.

Last night, I realized that @nest was nesting multiple sets sequentially not in parallel. This was causing dimension mismatches leading to the issues returning to the original df when unnesting.

This is now fixed and it behaves the same as in R, So now the behavior for @unnest_wider, @unnest_longer, and @nest map to tidyr.

This returns to the original dataframe

df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"],  yz = 13:18, a = 7:12, ab = 12:-1:7)
nested_df = @nest(df4, n2 = starts_with("a"), n3 = y:yz)

@chain nested_df begin
  @unnest_wider(n3:n2)
  @unnest_longer(y:ab)
end

just like in R

df4 <- data.frame(
  x = c("a", "b", "a", "b", "C", "a"),
  y = c("e", "e", "e", "f", "f", "e"),
  yz = 13:18,
  a = 7:12,
  ab = 12:7)
nested_df = nest(df4, n2 = a:ab, n3 = y:yz)
nested_df %>% unnest_wider(n2:n3) %>% unnest_longer(a:yz)

I checked the intermediate state after unnesting wider and they match each other as well.

I think it is is finally ready from my standpoint. Again, sorry for the whirlwind of preemptive commits and thank you for helping me figure out some of the bugs.

kdpsingh commented 10 months ago

Awesome! Will look at this soon. Super excited to see this.