Closed drizk1 closed 9 months ago
Thanks so much, @drizk1! Let me take a look at the @nest()
syntax and see if I can make it line up with tidyr
. Don't remove it -- leave it in for now. I just want to see if I can figure out how to make it match.
This is exciting!
Sounds good! For context, originally, after separating them into two underlying functions, I thought I could use multiple dispatch with 2 @nest
macros but that was throwing errors so I split the macro in 2.
For @nest
I tried pairs before settling on the method above. Using pairs made tidy selection tricky.
For @nest_by
, I also couldn't figure out how to leverage keyword arguments with tidy selection so that the argument names would be visible, but perhaps there is a way.
I'm also happy to go back to the drawing board and see if I can make one function that performs the different nests to make the macro syntax match more easily.
Ah, this may be because while functions can perform dispatch based on types, macros can only perform dispatch based on the number of arguments. This is because macros see all arguments as expressions, so they are all the same type.
Looking at the tidyr nest_by()
documentation, it looks like this function returns a rowwise data frame. I have mostly figured out how to implement rowwise data frames but I would wait to add @nest_by()
until rowwise is properly implemented. We don't have to remove the codebase - I may just comment it out for now and focus on @nest()
.
Re the macros: ok this is great to know for the future.
Thank you for clarifying the rowwise aspect. I spent some time reading the cheat sheet and documentation again this morning, and it lines up in my mind more now.
Focusing on nest and commenting out nest_by until tidier has the rowwise dataframe ability sounds good to me. Let me know if there's I can do to help.
At this point, the main thing I'd like to see is support for grouped data frames in @nest()
. It doesn't need to support by =
yet because none of the TidierData macros support that yet.
If you pass it a grouped data frame, it should separately nest the selected columns into a data frame for each group, with one nested data frame per group.
Sweet. I was tinkering with that last night actually. I think I know how to make it happen now so I'll try to make it official in the next day or two
Alright I sorted out the grouping. While doing so, I realized tidyr nests into tibbles, which might be more similar to nesting into dataframes, than to the arrays I was nesting into.
Nesting into dataframes was just a few lines to changes, so switching it is no problem.
The question I now have is which one would you prefer it nests into? My understanding is that arrays may be less memory intensive but also less flexible than a dataframe?
We could theoretically offer an argument so the user can choose ?
I'm open to anything, but fully defer the decision to you, and I will implement it.
Ooh thanks for catching this. I think we should nest into DataFrames, which is important if you are nesting multiple columns. Let's not implement an option for alternatives for now. Just make sure that the unnesting works correctly if we nest into DataFrames. Once you do that, I'll review and merge. Exciting!
Alright, so now, unnesting supports dataframes.
@unnest_wider
can unnest arrays, tuples, dataframes and dicts. (adding tuple support was the only way to make the examples below possible)@unnest_longer
can unnest arrays and dataframesAnd @nest
nests into datafames.
I tested it against with the following tidyr
example that has dataframes of different lengths and achieved identical results for the following four examples so it should be ready to go
df = DataFrame(
x = 1:3,
y = Any[
DataFrame(),
DataFrame(a = [1], b = [2]),
DataFrame(a = 1:3, b = [3, 2, 1], c = [4, 4, 4])
]
)
@chain df begin
@unnest_wider(y)
@unnest_longer(a:c, keep_empty = true)
end
@chain df begin
@unnest_wider(y)
@unnest_longer(a:c, keep_empty = false)
end
@chain df begin
@unnest_longer(y, keep_empty = true)
@unnest_wider(y)
end
@chain df begin
@unnest_longer(y, keep_empty = false)
@unnest_wider(y)
end
This looks amazing! I will review and merge soon.
Great work thus far. Discovered one issue.
@nest()
doesn't quite match up with the nest()
behavior in R tidyr.
For example, in R, nesting multiple columns produces this:
> df = tibble(a = rep(letters[1:5], each = 3), b = 1:15, c = 16:30)
> df |> nest(data = b:c)
# A tibble: 5 × 2
a data
<chr> <list>
1 a <tibble [3 × 2]>
2 b <tibble [3 × 2]>
3 c <tibble [3 × 2]>
4 d <tibble [3 × 2]>
5 e <tibble [3 × 2]>
And in this PR, nesting multiple columns produces this:
julia> df = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
julia> @chain df @nest(data = b:c)
15×2 DataFrame
Row │ a data
│ Char DataFrame
─────┼─────────────────────
1 │ a 1×2 DataFrame
2 │ a 1×2 DataFrame
3 │ a 1×2 DataFrame
4 │ b 1×2 DataFrame
5 │ b 1×2 DataFrame
6 │ b 1×2 DataFrame
7 │ c 1×2 DataFrame
8 │ c 1×2 DataFrame
9 │ c 1×2 DataFrame
10 │ d 1×2 DataFrame
11 │ d 1×2 DataFrame
12 │ d 1×2 DataFrame
13 │ e 1×2 DataFrame
14 │ e 1×2 DataFrame
15 │ e 1×2 DataFrame
Any thoughts on how to fix?
Oh wow. Great catch. It is almost as if it groups it based on the remaining columns and then nests. I think using setdiff
to groupby the outer dataframe columns, and then converting the grouped dataframes to dataframes and nesting them might work.. I will play around with it.
edit: it works for 1 nest, now trying to sort out when its nesting multiple
Hmm this gives me an idea. It might be possible to implement @nest()
in just a couple lines of code.
Actually, go ahead with modifying what you have now and see if you can get it working.
There's one parsing functionality I'd need to add still to get this working with less code.
So I'll revisit this if you can't get it working the way you have it.
Alright, @nest
is now correctly determining the number of rows based on the groups and it supports multiple columns to group by for the outer df.
from the example above:
5×2 DataFrame
Row │ a data
│ Char DataFrame
─────┼─────────────────────
1 │ a 3×2 DataFrame
2 │ b 3×2 DataFrame
3 │ c 3×2 DataFrame
4 │ d 3×2 DataFrame
5 │ e 3×2 DataFrame
## this matches R as well
df = DataFrame(x = [1, 1, 1, 2, 2, 3], y = 1:6, z = 13:18, a = 7:12, ab = 12:-1:7);
@nest(df, n2 = starts_with("a"), n3 = (y:z))
3×3 DataFrame
Row │ x n3 n2
│ Int64 DataFrame DataFrame
─────┼─────────────────────────────────────
1 │ 1 3×2 DataFrame 1×2 DataFrame
2 │ 2 2×2 DataFrame 1×2 DataFrame
3 │ 3 1×2 DataFrame 1×2 DataFrame
I will note tho, that when trying to unnest multiple nested columns that i nest in this second example above, I am getting slightly different dimensions than with R. I suspect this might have to do with the slightly different behavior or unnest_wider (illustrated below - in Julia it won't add new rows, but in R it will)? Of note, when using only unnest_longer and unnest_wider in R, for the 6x5 df above, it does not return to a 6x5. It only does so if unnesting with unnest()
which I have not yet tried building.
Depending on what you think, I think i may have to go back and rework unnest_longer and unnest_wider given the example below.
in R to go back to original df
df4 <- data.frame(
x = c("a", "b", "a", "b", "C", "a"),
y = c("e", "e", "e", "f", "f", "3"),
yz = 13:18,
a = 7:12,
ab = 12:7
)
test4 = nest(df4, n2 = a:ab)
test4 %>%
unnest_wider(n2)
in julia to go back to original df
df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 = @nest(df4, n2 = a:ab)
@chain test4 begin
@unnest_wider(n2)
@unnest_longer(a:ab)
end
Thanks for the update. I'll take a look and see if I can figure out why it's behaving differently.
While I am eager to merge, I want to make sure things behave similarly across the implementations, especially for the use case where we nest and then unnest.
I totally agree.
please ignore the two commits below, and frankly most of my more recent comment above. they were my mind playing tricks on me.
#same results as in R
df44 = DataFrame(a = repeat('a':'e', inner = 3), b = 1:15, c = 16:30)
dfdf = @chain df44 @nest(data = b:c)
@chain dfdf @unnest_wider(data) @unnest_longer(b:c)
df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = [13, 14, 15, 16, 17, 13], a = 7:12, ab = 12:-1:7)
test4 = @nest(df4, n2 = yz:ab)
@chain test4 @unnest_wider(n2) @unnest_longer(yz:ab)
unnesting multiple columns of nests back to the orignal dataframes is the last frontier I think. I still get multiple dimensions for that.
Edit: I finally figured out where the bug is. the bug is not in either of the unnests, but in the nest
when nesting multiple sets of columns at once. The example below illustrate how some of the cells are not properly populated as the below yields differences from R.
test2 = @nest(df4, n2 = a:ab, n3 = y:yz)
@chain test2 begin @unnest_wider(n2:n3)
Sorry to have taken you on a journey of excess commits over the last week. I have a deep appreciation for un/nesting now.
Last night, I realized that @nest
was nesting multiple sets sequentially not in parallel. This was causing dimension mismatches leading to the issues returning to the original df when unnesting.
This is now fixed and it behaves the same as in R, So now the behavior for @unnest_wider
, @unnest_longer
, and @nest
map to tidyr.
This returns to the original dataframe
df4 = DataFrame(x = ["a", "b", "a", "b", "C", "a"], y = ["e", "e", "e", "f", "f", "e"], yz = 13:18, a = 7:12, ab = 12:-1:7)
nested_df = @nest(df4, n2 = starts_with("a"), n3 = y:yz)
@chain nested_df begin
@unnest_wider(n3:n2)
@unnest_longer(y:ab)
end
just like in R
df4 <- data.frame(
x = c("a", "b", "a", "b", "C", "a"),
y = c("e", "e", "e", "f", "f", "e"),
yz = 13:18,
a = 7:12,
ab = 12:7)
nested_df = nest(df4, n2 = a:ab, n3 = y:yz)
nested_df %>% unnest_wider(n2:n3) %>% unnest_longer(a:yz)
I checked the intermediate state after unnesting wider and they match each other as well.
I think it is is finally ready from my standpoint. Again, sorry for the whirlwind of preemptive commits and thank you for helping me figure out some of the bugs.
Awesome! Will look at this soon. Super excited to see this.
This pull request got a little bigger than I initially anticipated, but the four added macros all support tidy selection and interpolation.
These two #34 support grouped dataframes (ungroup -> regroup)
indicies_include
, andkeep_empty
After the
unnests
, I thought I would trynest
. I was struggling with some syntax issues keeping@nest(df, by = , key = )
in the same macro with@nest(df, nested_col = cols)
method, so I ended up splitting them for the sake of simplicity.@nest_by
looks slightly different then the tidyr version in that theby
andkey
are not explicitly written, but supported.by
argument above, but similar (looks like maybe each group becomes its own df/array?).Before going further and writing brief documentation for the
nests
, I thought I would check in. Should I drop the nests from theis PR for now, while I try to sort out grouping and while I continue to try reducing them back into just 1 macro.I also added tidy selection to
@unite
.