JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

[Suggestion] A (public) function that takes the same args as `subset` and returns the matched indices #3239

Open rben01 opened 1 year ago

rben01 commented 1 year ago

It would be great if DataFrames.jl had a function or functions that would function more or less the same way subset does, except that they'd would return a vector containing the indices of kept rows instead of a new frame. This vector would be suitable for subsequent row indexing. (Thankfully this function already more or less exists already.) For example, you'd have something like this:

julia> df = allcombinations(DataFrame, Symbol("col 1")=>1:5, Symbol("col 2")=>1:5); df[!, "col 3"] = missings(String, nrow(df)); df
25×3 DataFrame
 Row │ col 1  col 2  col 3   
     │ Int64  Int64  String? 
─────┼───────────────────────
   1 │     1      1  missing 
   2 │     2      1  missing 
   3 │     3      1  missing 
  ⋮  │   ⋮      ⋮       ⋮
  23 │     3      5  missing 
  24 │     4      5  missing 
  25 │     5      5  missing 
              19 rows omitted

julia> #= current way (BitVector) =# (df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1)
25-element BitVector:
 0
 1
 0
 1
 0
 ⋮
 0
 1
 0
 1
 0

julia> #= current way (indices) =# findall((df[!, "col 1"] .% 2 .== 0) .&& (df[!, "col 2"] .% 2 .== 1))
6-element Vector{Int64}:
  2
  4
 12
 14
 22
 24

julia> subset_conditions(df, selectors...; skipmissing::Bool=false, threads::Bool=true) =
           DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads);

julia> #= proposed way =# subset_conditions(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
25-element BitVector:
 0
 1
 0
 1
 0
 ⋮
 0
 1
 0
 1
 0

julia> subset_indices(df, selectors...; skipmissing::Bool=false, threads::Bool=true) = 
           findall(DataFrames._get_subset_conditions(df, Ref{Any}(selectors), skipmissing, threads));

julia> #= proposed way =# subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1)
6-element Vector{Int64}:
  2
  4
 12
 14
 22
 24

Since these are suitable for indexing, you can do something like this:

julia> df[subset_indices(df, "col 1" => c -> c .% 2 .== 0, "col 2" => c -> c .% 2 .== 1), "col 3"] .= "even,odd"; df
25×3 DataFrame
 Row │ col 1  col 2  col 3    
     │ Int64  Int64  String?  
─────┼────────────────────────
   1 │     1      1  missing  
   2 │     2      1  even,odd
   3 │     3      1  missing  
  ⋮  │   ⋮      ⋮       ⋮
  23 │     3      5  missing  
  24 │     4      5  even,odd
  25 │     5      5  missing  
               19 rows omitted

For a simple example like this not much is gained, but for more complicated functions I think it begins to be worth it — especially if you have ByRow transformations that are tricky to express with broadcasting.

rben01 commented 1 year ago

Alternatively, a general-purpose way of taking “cross sections” of DataFrames that can handle both rows and columns. Maybe the same way that there's a ColumnIndex type that is used for column indexing, a RowIndex type could be created to wrap these selectors before passing them to DataFrame indexing functions. Something like (say) df[RowIndex(row_selectors), col_selectors].

I'd love to be able to say e.g.,

df[RowIndex(:A => ByRow(passmissing(func)), :B => c -> c .% 2 .== 0; skipmissing=true), :C] .= "whatever"
bkamins commented 1 year ago

Your request essentially asks for allowing more complex row selection rules in indexing.

I would like to start with the discussion why it is needed. What I mean is I want to understand why using subset or filter is not enough for you? Do I understand it correctly that you want to avoid having to call e.g. subset + select combination and instead be able to call e.g. getindex or view?

So, you want df[row_selector, col_selector] instead of select(subset(df, row_selector), col_selector). Is my understanding correct?

If yes - then could you comment in what cases it is most useful? Thank you!

rben01 commented 1 year ago

@bkamins Yes, you're correct. The reason I'd like to have the df[row_selector, col_selector] syntax is that select(subset(df, row_selector), col_selector) is more verbose, and if you want a view into the df then it gets even more verbose: select(subset(df, row_selector; view=true), col_selector; copycols=false). And forgetting either kwargs will lead to a hard-to-spot bug. On the other hand, df[row_selector, col_selector] is concise and clearly returns a view into the data.