JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.74k stars 367 forks source link

Support type-based column selectors #3034

Open wolthom opened 2 years ago

wolthom commented 2 years ago

Currently, when applying a transformation to all columns of a specific type (or subtypes of an abstract type), a pattern such as transform(df, names(df, Number) .=> f) is used. Ideally, this could be achieved with a column-selector, e.g. transform(df, Cols(Number) .=> f).

While a minor convenience feature, this may make the column-selector API (even) more consistent and users don't have to repeat the name of the DataFrame multiple times.

bkamins commented 2 years ago

Yes - I just need to think if there are any corner cases that would lead to problems. We could even potentially allow df[:, Number] it if does not lead to problems.

bkamins commented 2 years ago

OK - now I remember why we do not have this.

Except the names function all other column selectors currently get resolved in the context of AbstractIndex not AbstractDataFrame (i.e. have only access to column names, but to not have access to column contents).

So adding the requested functionality would require a significant redesign. This is of course doable.

@nalimilan - what do you think?

nalimilan commented 2 years ago

I agree it would be nice to be able to do transform(df, Cols(Number) .=> f) at least. But yeah the implementation may not be trivial. (This was discussed briefly at https://github.com/JuliaData/DataFrames.jl/pull/2400.)

bkamins commented 2 years ago

@nalimilan - I can do it. The only issue is that the PR might end up being 1000 lines and touch many files so it will be hard to review (not sure yet - maybe it will be easier). Essentially we need to drop using AbstractIndex almost everywhere and instead pass around AbstractDataFrame. This is challenging because we need to correctly handle all types that DataFrames.jl defines (as deep down they all use AbstractIndex somewhere).

In other words the original design of DataFrames.jl assumes such functionality will not be needed (AbstractIndex is not aware of column element types) so we need to change fundamental element of the design here.

bkamins commented 2 years ago

@nalimilan - let us make a decision if we:

  1. add it in 1.4 release.
  2. postpone to later releases for a decision.
  3. keep the things as they are (i.e. require names(df, "type") syntax).

I would like to finalize the scope of 1.4 release so that we can have it before JuliaCon.

bkamins commented 2 years ago

I move it to 1.5 release for a decision

bkamins commented 1 year ago

I was thinking about it. The issue is that AbstractIndex was designed as: https://github.com/JuliaData/DataFrames.jl/blob/b240458aca1681e74a94e979a0141b2b16f1a3e0/src/other/index.jl#L1

so it - by design - only supports name lookup.

Now the issue is that to create a DataFrame, we have to construct its index before. So we even cannot naturally have a back-refrence to a data frame in index.

In summary this means that it is a major redesign of DataFrame, SubDataFrame, DataFrameRow, Index, and SubIndex if we wanted to allow for such a change. One particular consequence is that 1.5 release would be incompatible with 1.4 release on binary level (and people often serialize/jld data frames).

@nalimilan - the question is if we want to do it.

An alternative would be to special case such selector before passing it to index, but this will lead to ugly design (in many places we will have to apply a patch that is hard to maintain).

bkamins commented 1 year ago

After more thinking I am giving it a 1.x milestone. Maybe we will add it at some point, but it is not likely we will do it fast. For now users need to use names or work with eachcol to filter on element type of a column.

bkamins commented 1 year ago

In this issue let us track all request for basing column selection on column values (as column element type is just a special case).

In this post I discuss the choice in more detail.

If you feel we should add this functionality please vote up: 👍. If you feel it is OK not to have a special syntax for it please vote down: 👎.

Thank you!

alfaromartino commented 1 year ago

My two cents about why I wouldn't recommend adding a new method:

  1. The operation can be implemented in other ways already. The more methods to implement the same feature, the harder to read code written by third parties. This aspect affects new users, who would become really confused about what methods to learn when they're learning the language.

  2. Somewhat related to 1, adding new syntax for DataFrames forces the new users to learn syntax specific to Dataframes (even if it's just to read other people's code). This is problematic if they're learning the Julia language in general.

  3. From what's described, the implementation doesn't seem so easy and there are some issues involved. In a context where it's not trivial, I think implementing other features would be more beneficial. For example, any performance improvement seems more beneficial than implementing one more method for the same (e.g., I read somewhere about an improvement of groupby operations when there are a lot of small groups).

kdpsingh commented 1 year ago

I appreciate the thoughtful examples in the blog post! With the examples you’ve given there, I think I should be able to wrap this functionality within TidierData.jl. The only piece I’m concerned about is making sure I escape the data frame in the right place since I have a bunch of functions that parse and modify the expression along the way. Will let you know if I run into roadblocks.

tp2750 commented 1 year ago

Looks like an interesting feature. I like it being an explicit functionality, as it makes it easier to find in the documentation. I was not able to find examples of value-based column selection in the DataFrames.jl documentation.

If there is no performance benefit of

select(df, Cols(startswith("a")) .& Vals(x -> any(ismissing(x))))

over

select(df, [startswith(string(n), "a") && any(ismissing, c)
                   for (n,c) in pairs(eachcol(df))])

perhaps it might as well be done by a macro in DataFramesMeta?

math4mad commented 10 months ago

If work with PCA or cor(Matrix), better with Number Type, how to define supertype ?

using  Pipe,Tidier

df =load_csv("airbnb_nyc_2019",false)
type_df=@pipe describe(df)|>select(_,[:variable,:eltype])
int_df=@chain type_df begin
    @filter(isa(eltype,Union{Type{Int64},Type{Float64}}))
end

@filter(isa(eltype,Union{Type{Int64},Type{Float64}})) there are better way to define this type ?

kdpsingh commented 10 months ago

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

math4mad commented 10 months ago

Hi @math4mad,

Thanks for the question. Just to clarify, are you asking:

  • In general, how to identify super types?
  • Or how to get this code to work in TidierData.jl?
  • Or how to only select columns containing integers/floats in either TidierData.jl or DataFrames.jl?

Or all of the above?

That may help with tailoring the reply a bit better. Thanks!

just select columns containing Numerical super-type

bkamins commented 10 months ago

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

  1. eltype(col) <: Number
  2. all(x -> x isa Number, col)
  3. eltype(col) <: Union{Missing, Number}
  4. all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

math4mad commented 10 months ago

just select columns containing Numerical super-type

Do you mean to select all columns (denoted col below) for which:

  1. eltype(col) <: Number
  2. all(x -> x isa Number, col)
  3. eltype(col) <: Union{Missing, Number}
  4. all(x -> x isa Union{Missing, Number}, col)

(I am listing four most common cases you might want to select.

at now I think would be option 2