JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Add operator kwarg to Cols #58

Closed bkamins closed 1 year ago

bkamins commented 1 year ago

x-ref https://github.com/JuliaData/DataFrames.jl/pull/3224

codecov[bot] commented 1 year ago

Codecov Report

Base: 96.36% // Head: 96.36% // No change to project coverage :thumbsup:

Coverage data is based on head (6fc6ae4) compared to base (69313ee). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #58 +/- ## ======================================= Coverage 96.36% 96.36% ======================================= Files 1 1 Lines 55 55 ======================================= Hits 53 53 Misses 2 2 ``` | [Impacted Files](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/58?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData) | Coverage Δ | | |---|---|---| | [src/DataAPI.jl](https://codecov.io/gh/JuliaData/DataAPI.jl/pull/58/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData#diff-c3JjL0RhdGFBUEkuamw=) | `96.36% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JuliaData)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

bkamins commented 1 year ago

CC @jar @pdeffebach @yjunechoe

adienes commented 1 year ago

I think we discussed this in the previous issue but I forgot---could you remind me why we use keywords here rather than a function? Then it would have Cols(...; operation=any) and Cols(...; operation=all) using built in functions any and all, and could use user-defined predicates.

bkamins commented 1 year ago

We have not discussed it previously. Actually what you propose is a good idea. But any and all are not correct functions to be passed. We can say that union function is the default and intersect is a function to pass if one wants intersection. However, potentially user could pass any function operating on sets and returning a set as a result.

Then the definition in DataFrames.jl would be:

@inline Base.getindex(x::AbstractIndex, idx::Cols) =
    isempty(idx.cols) ? Int[] : idx.operation(getindex.(Ref(x), idx.cols)...)

It would be more error prone than the current implementation (potentially leading to crazy error stack traces) but more flexible indeed. @nalimilan - what do you think?

adienes commented 1 year ago

Makes sense. Then we'd also get setdiff and symdiff for free

jariji commented 1 year ago

Yep, I agree with adienes.

nalimilan commented 1 year ago

Yeah why not. Maybe we need a more specific name than "operation" then for the argument. Again the same discussion as the one about "combine" in unstack? :-)

bkamins commented 1 year ago

AFAICT this fits well. In mathematics union and intersect etc. are typically called "operations", c.f. e.g. https://en.wikipedia.org/wiki/Set_(mathematics)#Basic_operations

bkamins commented 1 year ago

I have updated the PR.

adienes commented 1 year ago

Perhaps operator is very slightly a better choice than operation for some very tenuous reasons:

  1. often the arguments in Cols will be indicator/characteristic functions of sets, and a function that acts on other functions is often called an operator
  2. to me it scans slightly more clearly that what follows operator= should be a verb (i.e. a function)
  3. it is 1 character shorter :)

The wikipedia page for the Algebra of Sets uses both terms interchangeably, so I'm not sure which is more standard.

bkamins commented 1 year ago

I checked several references, and the border between operation and operator is thin. Operation (mathematics) gives a most clear distinction:

An operator is similar to an operation in that it refers to the symbol or the process used to denote the operation, hence their point of view is different. For instance, one often speaks of "the operation of addition" or "the addition operation", when focusing on the operands and result, but one switches to "addition operator" (rarely "operator of addition"), when focusing on the process

Given this explanation indeed operator seems better, but I am not sure.

adienes commented 1 year ago

especially when the symbols and are used instead of intersect and union then operator definitely seems like the better word there

bkamins commented 1 year ago

changed

bkamins commented 1 year ago

Thank you!