JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Details of the functionality of Cols #19

Closed bkamins closed 4 years ago

bkamins commented 4 years ago

We had a discussion with @nalimilan how exactly Cols should work.

The question is if:

Cols(:x1, :x1)

should be an allowed selection rule.

What I mean is that clearly things like:

Cols(r"x", Between(:x, :y))

should be allowed (taking a union), but should we allow taking a union of single column selectors.

As a corner case:

Cols(:x1, [:x1])

would be allowed (as [:x1] selects multiple columns - just in this case it is just a single column).

The rationale is that it is unlikely that:

Cols(:x1, :x1)

is an intended thing (probably this is an error on the user side).

@piever - do you have any opinion on this?

bkamins commented 4 years ago

CC @pdeffebach, @CameronBieganek

pdeffebach commented 4 years ago

To clarify the issue is as follows, right?

function foo(x, y)
    x .+ y .- mean(x)
end

My intuition is to throw an error. Given that Cols promises no guarantee of ordering in the vast majority of scenarios, the user shouldn't rely on a 1:1 behavior of Cols in this context. They can do

select(df, [:x1, :x1] => foo)

If they want, right?

nalimilan commented 4 years ago

@pdeffebach No, the idea is that if it didn't error, Cols(:x1, :x1) would be equivalent to [:x1] (contrary to [:x1, :x1]).

As usual, we can throw an error for now and relax the rules later if needed.

pdeffebach commented 4 years ago

Sorry, I meant the error at the call of foo, which requires two arguments.

The behavior you described is the correct behavior. I believe Cols(:x1, :x1) should be equivalent to [:x1].

bkamins commented 4 years ago

I believe Cols(:x1, :x1) should be equivalent to [:x1].

What is the benefit of allowing this over throwing an error?

nalimilan commented 4 years ago

I guess it could make sense if these are stored in variables which can take multiple values. Though half of the time it would probably indicate a bug and you can always call unique first.

pdeffebach commented 4 years ago

What is the benefit of allowing this over throwing an error?

I think it would be frustrating for users if

Cols(x #= Container which contains :x1 =#, y #= Container which contains :x1=#) # Between(), etc.

worked but not

Cols(:x1, :x1)
bkamins commented 4 years ago

half of the time it would probably indicate a bug

This is my understanding of the situation, but given there is no support for it it is even easier to have Cols behave by always taking a union.

CameronBieganek commented 4 years ago

Cols(:x1, :x1) == Cols(:x1) makes sense to me. That's consistent with the

Cols(A, B, C) ==> (A, B\A, C\(A ∪ B))

behavior I mentioned in the other thread.

bkamins commented 4 years ago

It is consistent when A, B, C are sets. I considered :x1 to be a separate case as it is not a set but a set element.

CameronBieganek commented 4 years ago

I'm willing to consider :x1 to be a set with one element. I'd hate to have to write [:x1] all the time. 🙂

piever commented 4 years ago

I don't have a strong opinion here, but I would say it's fine to allow it. If one of them is not a symbol literal, this could be useful (say Cols(:x1, col) and col is set programmatically).

The only analogous thing I can think of is for example repeated named arguments in julia Base, and there they are only forbidden if they are passed "not programmatically":

julia> (; a = 1, a = 2)
ERROR: syntax: field name "a" repeated in named tuple
Stacktrace:
 [1] top-level scope at REPL[1]:1
 [2] eval(::Module, ::Any) at ./boot.jl:331
 [3] eval_user_input(::Any, ::REPL.REPLBackend) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/REPL/src/REPL.jl:86
 [4] run_backend(::REPL.REPLBackend) at /home/pietro/.julia/packages/Revise/WkyNB/src/Revise.jl:1023
 [5] top-level scope at none:0

julia> (; a = 1, :a => 2)
(a = 2,)

As we do not have a good way to test if the symbol was a literal or came from some variable, the simplest is probably to just allow it.