Open rben01 opened 1 year ago
Alternatively, a general-purpose way of taking “cross sections” of DataFrames that can handle both rows and columns. Maybe the same way that there's a ColumnIndex
type that is used for column indexing, a RowIndex
type could be created to wrap these selectors before passing them to DataFrame indexing functions. Something like (say) df[RowIndex(row_selectors), col_selectors]
.
I'd love to be able to say e.g.,
df[RowIndex(:A => ByRow(passmissing(func)), :B => c -> c .% 2 .== 0; skipmissing=true), :C] .= "whatever"
Your request essentially asks for allowing more complex row selection rules in indexing.
I would like to start with the discussion why it is needed. What I mean is I want to understand why using subset
or filter
is not enough for you? Do I understand it correctly that you want to avoid having to call e.g. subset
+ select
combination and instead be able to call e.g. getindex
or view
?
So, you want df[row_selector, col_selector]
instead of select(subset(df, row_selector), col_selector)
. Is my understanding correct?
If yes - then could you comment in what cases it is most useful? Thank you!
@bkamins Yes, you're correct. The reason I'd like to have the df[row_selector, col_selector]
syntax is that select(subset(df, row_selector), col_selector)
is more verbose, and if you want a view into the df then it gets even more verbose: select(subset(df, row_selector; view=true), col_selector; copycols=false)
. And forgetting either kwargs will lead to a hard-to-spot bug. On the other hand, df[row_selector, col_selector]
is concise and clearly returns a view into the data.
It would be great if DataFrames.jl had a function or functions that would function more or less the same way
subset
does, except that they'd would return a vector containing the indices of kept rows instead of a new frame. This vector would be suitable for subsequent row indexing. (Thankfully this function already more or less exists already.) For example, you'd have something like this:Since these are suitable for indexing, you can do something like this:
For a simple example like this not much is gained, but for more complicated functions I think it begins to be worth it — especially if you have
ByRow
transformations that are tricky to express with broadcasting.