JuliaData / DataAPI.jl

A data-focused namespace for packages to share functions
Other
33 stars 13 forks source link

Add `Selector` abstract type for ecosytem compat, and rethink `Between` #41

Open rafaqz opened 3 years ago

rafaqz commented 3 years ago

DimensionalData.jl defines a Selector abstract type and selectors Between, Near, At, Where, and Contains.

AxisKeys.jl also defines Selector and Near and Inervals selectors (ping @mcabbott)

It would be useful to standardise these so we can all use the simplest words without clashes, and this is probably the place for that.

We could define Selector here, and make Between <: Selector. Near could also move here to avoid the (maybe less common) clashes between AxisKeys.jl and DimensionalData.jl, if @mcabbott is also interested in sharing these types.

Between in DD holds a 2-tuple of any types, while here it has two fields limited to Union{Int,Symbol}.

bkamins commented 3 years ago

I agree we can make this change. This would be breaking for DataAPI.jl, so 2.0 release would be needed, which is a bit painful given then number of dependants https://juliahub.com/ui/Packages/DataAPI/3a8mN/1.8.0?t=2. However, I assume that only DataFrames.jl uses Between now, so it would be just a matter of changing 1 to 1, 2 in Project.toml files.

In DataFrames.jl we could allow both 1 and 2 release of DataAPI.jl and appropriately handle both cases.

AN ALTERNATIVE that would be non-breaking would be to define Between to hold two fields of any types (so then DD and AxisKeys.jl would need to accommodate). This would be less breaking (so I feel it is preferred), but maybe there are some benefits of holding a 2-tuple so let us discuss.

CC @nalimilan

rafaqz commented 3 years ago

If it's less breaking DimensionalData.jl can change to 2 fields, there is a breaking version bump coming up soon anyway, and it wont affect behavior as the constructor already accepts either Tuple or 2 arguments.

The reason for using 1 field was so that all Selectors have a val field holding the thing to be selected, which is a single value or a Tuple. Just to standardise that aspect. The val method also returns that value. But otherwise it's not important.

rafaqz commented 3 years ago

DD also defines rebuild methods for these to update the contents programatically, with ConstructionBase.setproperties as the default keyword constructor. Keeping field names the same removes all boilerplate. You can rebuild any selector with a modified value using the same syntax.

rebuild(x; kw...) = ConstructionBase.setproperties(x, (; kw...))

So that:

rebuild(sel; val=2 .* val(sel))

Or similar will just work, even when a Selector has additional metadata fields, like atol for At. Otherwise updating Selctor values progoramatically has to be special-cased for each selector type.

bkamins commented 3 years ago

DD also defines rebuild methods for these to update the contents programatically

This is what I have feared that there might be such cases. Let us wait for @nalimilan to comment in the first place what he thinks about merging the definitions and then we will decide what path should be taken to merge them (as maybe we can agree that DataAPI.jl definition is "internal" and changing it is non-breaking as long as we keep the external API untouched).

nalimilan commented 3 years ago

Merging definitions sounds like a good idea, but tagging DataAPI 2.0 now is a no-go IMO. If we really want to switch to a single field, we could define appropriate constructors and getproperty/setproperty! methods to continue supporting the previous syntax.

That said, I don't really get the value of being able to do rebuild(sel; val=2 .* val(sel)) on any type of selector. Does happen that you don't know what kind of selector you have and just want to multiply the values by 2? As a counter-argument, it seems that rebuild(sel; last=2) could be more often useful if you want to keep the starting point but change the ending point.

mcabbott commented 3 years ago

AxisKeys.jl takes Interval is from IntervalSets.jl, and Not from InvertedIndices.jl. I doubt these two will want extra supertypes (although IntervalSets.jl depends on EllipsisNotation.jl which I see now depends on ArrayInterface.jl so it's not 10 lines anymore.)

Its struct Near{T} <: Selector{T} val::T end holds anything, and has the meaning of taking the nearest one element. I think that's the same meaning as DD.jl. The alternative meaning would be a value with a tolerance.

bkamins commented 3 years ago

One small consideration to also mention is https://github.com/JuliaData/DataAPI.jl/blob/92b0def99ec396a50ab22d594ed9c37920537156/src/DataAPI.jl#L196. This is intended to allow for lazy broadcasting of the selector (i.e. that it can be properly expanded by a caller side that is aware of the actual range of indices selected). Would this break something in the other packages?

rafaqz commented 3 years ago

@nalimilan Multiplying values by 2 was a bad example... practically it's more like swapping out their contents, e.g. a Vector, for the individual values of the vector. But 2 fields is fine.

But maybe DD can also move to using Interval from IntervalSets.jl like AxisKeys, and this wont be an issue. I used Between because the pattern is selectors as positional words. And interval is confusing when an index can represent intervals or points, both of with can be selected with Between.

@mcabbott yes Near means the same thing in both AK and DD.