Closed mistermichaelll closed 1 week ago
Thanks for catching this. I haven't looked at the slices in a while.
I think the issue might be due to this block of code, where n is not being captured. Although I'm surprised because the doc strings work.
for expr in exprs
if @capture(expr, lhs_ = rhs_)
expr_dict[lhs] = rhs
if lhs == :missing_rm
missing_rm = rhs
elseif lhs == :prop
arranged = true
end
else
column = expr
end
end
if haskey(expr_dict, :with_ties)
with_ties = expr_dict[:with_ties]
end
I'll play around with it and get back to you shortly
Curious, are the doc strings successfully building the only validation for the output of most macros?
I'm wondering if it's worth adding some additional tests beyond just the pivoting functions, potentially even just replicating some from the dplyr source code to catch some of the less straightforward edgecases/make sure the code is robust.
I'd be happy to help with that if that's something that'd be beneficial!
I will defer to @kdpsingh on the additional test as he's the Tidier Maven.
as far as the slice_max
bug, I have fixed it. I just need to make the corresponding changes for slice_min
julia> @slice_max(df, bike_commuters, n = 7)
7×3 DataFrame
Row │ city total_commuters bike_commuters
│ String Int64 Int64
─────┼────────────────────────────────────────────────────────────────────
1 │ New York city, New York 4007171 48601
2 │ Chicago city, Illinois 1318960 48601
3 │ Portland city, Oregon 347260 21982
4 │ Los Angeles city, California 1938625 20495
5 │ San Francisco city, California 500469 19429
6 │ Washington city, District of Col… 362204 16647
7 │ Seattle city, Washington 419233 14801
julia> df = DataFrame(
a = ["a", "b","a", "b", "a","b" ,"a", "b"],
b = [0.3, 2, missing, 3, 6, 5, 7, 7],
c = [0.2, 0.2, 0.2, missing, 1, missing, 5, 6]);
julia> @chain df begin
@group_by a
@slice_max b n = 3
end
GroupedDataFrame with 2 groups based on key: a
First Group (3 rows): a = "a"
Row │ a b c
│ String Float64? Float64?
─────┼────────────────────────────
1 │ a 7.0 5.0
2 │ a 6.0 1.0
3 │ a 0.3 0.2
â‹®
Last Group (3 rows): a = "b"
Row │ a b c
│ String Float64? Float64?
─────┼─────────────────────────────
1 │ b 7.0 6.0
2 │ b 5.0 missing
3 │ b 3.0 missing
julia> @chain df begin
@group_by a
@slice_max b
end
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = "a"
Row │ a b c
│ String Float64? Float64?
─────┼────────────────────────────
1 │ a 7.0 5.0
â‹®
Last Group (1 row): a = "b"
Row │ a b c
│ String Float64? Float64?
─────┼────────────────────────────
1 │ b 7.0 6.0
Thanks @mistermichaelll for reporting and @drizk1 for the fix! I will review and merge.
No problem @kdpsingh !
I was wondering if you saw my suggestion about adding additional unit tests that reflect the ones in the original dplyr
package? I think that making the test suite more robust could go a long way towards building trust in the Tidier
ecosystem as it starts seeing more adoption.
Like I said, I'm happy to take that on but just wanted to see if that's something you think would be beneficial!
We would love to have additional tests and would really appreciate your help with this.
In general, my preference has been to build all/most tests as doctests (in the docstrings) so that users can also get a clear view into should and shouldn't work.
But I'm open to additional tests in the tests/
folder depending on what you have in mind.
Feel free to open an issue and/or a PR!
Sorry it took so long. This particular issue is now fixed in #110.
Here's some cycling-flavored data for a reprex 😄 I'm on Julia v.1.10.4.
Let's say that I want to pull the top 10 cities with the most bike commuters. I'd expect that I could use the
@slice_max
macro and do something simple like this:However, this returns:
Returning a single row in spite of the
n
value is expected behavior ifwith_ties = false
according to the docs, but the default should bewith_ties = true
and there are no obvious ties in this data. So I would actually expect the result to be this:If I pop over to R with the same dataset, this is the result I get from
df |> slice_max(bike_commuters, n = 10)
, even if I shuffle the dataset around such that thebike_commuters
column is not already ordered descending. I would expect@slice_max
in Julia to behave the same way in this case.If I try other values of
n
with the@slice_max
macro, I still get back the 1x3 DataFrame.