TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
83 stars 7 forks source link

@slice_max only returning one row, despite no ties #108

Closed mistermichaelll closed 1 week ago

mistermichaelll commented 1 month ago

Here's some cycling-flavored data for a reprex 😄 I'm on Julia v.1.10.4.

using DataFrames
using Tidier
#> (@v1.10) pkg> status DataFrames, Tidier
#> Status `~/.julia/environments/v1.10/Project.toml`
#>   [a93c6f00] DataFrames v1.6.1
#>   [f0413319] Tidier v1.4.0

df = DataFrame(
    :city => [
        "New York city, New York",
        "Chicago city, Illinois",
        "Portland city, Oregon",
        "Los Angeles city, California",
        "San Francisco city, California",
        "Washington city, District of Columbia",
        "Seattle city, Washington",
        "Philadelphia city, Pennsylvania",
        "Boston city, Massachusetts",
        "Minneapolis city, Minnesota",
        "Austin city, Texas",
        "Denver city, Colorado",
        "San Diego city, California",
        "Madison city, Wisconsin",
        "Oakland city, California"
    ],
    :total_commuters => [
        4007171,
        1318960,
        347260,
        1938625,
        500469,
        362204,
        419233,
        662522,
        362198,
        231240,
        540911,
        377214,
        715730,
        146715,
        215534,
    ],
    :bike_commuters => [
        48601,
        22449,
        21982,
        20495,
        19429,
        16647,
        14801,
        14397,
        8873,
        8465,
        8266,
        8181,
        7188,
        7186,
        6540,
    ]
)

Let's say that I want to pull the top 10 cities with the most bike commuters. I'd expect that I could use the @slice_max macro and do something simple like this:

@slice_max(df, bike_commuters, n = 10)

However, this returns:

1×3 DataFrame
 Row │ city                     total_commuters  bike_commuters 
     │ String                   Int64            Int64          
─────┼──────────────────────────────────────────────────────────
   1 │ New York city, New York          4007171           48601

Returning a single row in spite of the n value is expected behavior if with_ties = false according to the docs, but the default should be with_ties = true and there are no obvious ties in this data. So I would actually expect the result to be this:

10×3 DataFrame
 Row │ city                               total_commuters  bike_commuters 
     │ String                             Int64            Int64          
─────┼────────────────────────────────────────────────────────────────────
   1 │ New York city, New York                    4007171           48601
   2 │ Chicago city, Illinois                     1318960           22449
   3 │ Portland city, Oregon                       347260           21982
   4 │ Los Angeles city, California               1938625           20495
   5 │ San Francisco city, California              500469           19429
   6 │ Washington city, District of Col…           362204           16647
   7 │ Seattle city, Washington                    419233           14801
   8 │ Philadelphia city, Pennsylvania             662522           14397
   9 │ Boston city, Massachusetts                  362198            8873
  10 │ Minneapolis city, Minnesota                 231240            8465

If I pop over to R with the same dataset, this is the result I get from df |> slice_max(bike_commuters, n = 10), even if I shuffle the dataset around such that the bike_commuters column is not already ordered descending. I would expect @slice_max in Julia to behave the same way in this case.

If I try other values of n with the @slice_max macro, I still get back the 1x3 DataFrame.

drizk1 commented 1 month ago

Thanks for catching this. I haven't looked at the slices in a while.

I think the issue might be due to this block of code, where n is not being captured. Although I'm surprised because the doc strings work.


for expr in exprs
      if @capture(expr, lhs_ = rhs_)
          expr_dict[lhs] = rhs
          if lhs == :missing_rm
              missing_rm = rhs
          elseif lhs == :prop
              arranged = true
          end
      else
          column = expr
      end
  end
  if haskey(expr_dict, :with_ties)
      with_ties = expr_dict[:with_ties]
  end

I'll play around with it and get back to you shortly

mistermichaelll commented 1 month ago

Curious, are the doc strings successfully building the only validation for the output of most macros?

I'm wondering if it's worth adding some additional tests beyond just the pivoting functions, potentially even just replicating some from the dplyr source code to catch some of the less straightforward edgecases/make sure the code is robust.

I'd be happy to help with that if that's something that'd be beneficial!

drizk1 commented 1 month ago

I will defer to @kdpsingh on the additional test as he's the Tidier Maven.

as far as the slice_max bug, I have fixed it. I just need to make the corresponding changes for slice_min

julia> @slice_max(df, bike_commuters, n = 7)
7×3 DataFrame
 Row │ city                               total_commuters  bike_commuters 
     │ String                             Int64            Int64          
─────┼────────────────────────────────────────────────────────────────────
   1 │ New York city, New York                    4007171           48601
   2 │ Chicago city, Illinois                     1318960           48601
   3 │ Portland city, Oregon                       347260           21982
   4 │ Los Angeles city, California               1938625           20495
   5 │ San Francisco city, California              500469           19429
   6 │ Washington city, District of Col…           362204           16647
   7 │ Seattle city, Washington                    419233           14801

julia> df = DataFrame(
                         a = ["a", "b","a", "b", "a","b" ,"a", "b"],
                         b = [0.3, 2, missing, 3, 6, 5, 7, 7],
                         c = [0.2, 0.2, 0.2, missing, 1, missing, 5, 6]);

julia> @chain df begin 
       @group_by a
       @slice_max b n = 3
       end
GroupedDataFrame with 2 groups based on key: a
First Group (3 rows): a = "a"
 Row │ a       b         c        
     │ String  Float64?  Float64? 
─────┼────────────────────────────
   1 │ a            7.0       5.0
   2 │ a            6.0       1.0
   3 │ a            0.3       0.2
â‹®
Last Group (3 rows): a = "b"
 Row │ a       b         c         
     │ String  Float64?  Float64?  
─────┼─────────────────────────────
   1 │ b            7.0        6.0
   2 │ b            5.0  missing   
   3 │ b            3.0  missing   

julia> @chain df begin 
       @group_by a
       @slice_max b
       end
GroupedDataFrame with 2 groups based on key: a
First Group (1 row): a = "a"
 Row │ a       b         c        
     │ String  Float64?  Float64? 
─────┼────────────────────────────
   1 │ a            7.0       5.0
â‹®
Last Group (1 row): a = "b"
 Row │ a       b         c        
     │ String  Float64?  Float64? 
─────┼────────────────────────────
   1 │ b            7.0       6.0
kdpsingh commented 1 month ago

Thanks @mistermichaelll for reporting and @drizk1 for the fix! I will review and merge.

mistermichaelll commented 1 month ago

No problem @kdpsingh !

I was wondering if you saw my suggestion about adding additional unit tests that reflect the ones in the original dplyr package? I think that making the test suite more robust could go a long way towards building trust in the Tidier ecosystem as it starts seeing more adoption.

Like I said, I'm happy to take that on but just wanted to see if that's something you think would be beneficial!

kdpsingh commented 1 month ago

We would love to have additional tests and would really appreciate your help with this.

In general, my preference has been to build all/most tests as doctests (in the docstrings) so that users can also get a clear view into should and shouldn't work.

But I'm open to additional tests in the tests/ folder depending on what you have in mind.

Feel free to open an issue and/or a PR!

kdpsingh commented 1 week ago

Sorry it took so long. This particular issue is now fixed in #110.