JuliaArrays / AxisArrays.jl

Performant arrays where each dimension can have a named axis with values
http://JuliaArrays.github.io/AxisArrays.jl/latest/
Other
200 stars 42 forks source link

Move more logic into Axis type? #15

Open mbauman opened 9 years ago

mbauman commented 9 years ago

Now that the axis information is stored as proper Axis types, one possible way to both allow an unchecked constructor and fix #14 would be to move the sort check into the Axis constructor. We could even go further, and make Axis an abstract super type of all these traits, with the Axis constructor returning the appropriate type. This could include, for example, a SortedDimensionalAxis type.

This would, of course, make the abstract Axis constructor type-unstable. It also moves more of the axis trait complexity into the foreground.

mbauman commented 9 years ago

Another way we could simplify things by making Axis objects more capable would be to make them an <: AbstractVector over their contents. The downside here is that Axis is also used for indexing identification, which don't need to be vectors. Scalars behave differently from Vectors and Intervals can't even support getindex. Perhaps this means we're punning too much on Axis.

mbauman commented 8 years ago

I still like the idea of making the Axis type do the validation of the axis vectors — like ensuring that dimensional axes are sorted. But everything beyond the first sentence of this post reads like gibberish to me now.

timholy commented 8 years ago

Yes, checking upon construction seems like a great idea. Supporting unsorted axes seems like a nightmare. issorted is pretty cheap (especially for Ranges :wink:).

But that raises a question: do we allow any monotonic sequence, or only increasing? I'm fine with purely increasing.

mbauman commented 8 years ago

do we allow any monotonic sequence, or only increasing? I'm fine with purely increasing.

I had initially restricted this to purely increasing, but we decided relatively early on that monotonicity was sufficient. Part of what enables this is that we error upon scalar indexing by a duplicated axis value. currently don't allow indexing by scalar axis value for dimensional axes, so there's never a question of what to do for repeated axis values. Even if we decide to eventually allow scalar indexing, we could simply define it to return the first value, which seems quite sensible to me.

timholy commented 8 years ago

Thanks for the clarification, I hadn't thought of the repeated value issue. Can you point me to some background? Esp. that might clarify what "use" repeated indices have?

I should add a clarification of my own: my real question was regarding monotonically decreasing sequences. I don't see the need to support them personally, but the basic concept of fast searching applies to any monotonic sequence, so I thought I'd ask.

mbauman commented 8 years ago

I'm not sure how much of my thinking here made it onto public venues… or still remains in my head for that matter (see my edit above). Here's a bit of a brain dump… which should probably be split across multiple issues.

This comes down to how I've categorized axis types into Dimensional or Categorical. It determines two things: which indexing-by-axis-value methods are supported and how lookup is performed.

Dimensional (sorted) Categorical (unique elements)
Scalar indexing only with !(T <: Real), error on dupes yes, no dupes by construction
Interval indexing yes not implemented
Vector indexing not implemented yes
Lookup methods searchsorted, searchsortedfirst findfirst, findin (should move to hashing)

So: I think a decreasing sequence could work just as well as an increasing one (assuming it implements the search interface properly), but I've not given that case much thought. At a minimum, It'll most likely break my searchsortednearest extension to the search API.


The restriction on Scalar indexing in Dimensional {T<:Real} AxisArrays is just to prevent ambiguity between location-based and axis-value-based indexing. It'd be interesting to experiment with making UnitRange-axis AxisArrays behave like an OffsetArray, but I think it'd wreck havoc with float axes and non-UnitRange integer axes. I'm concerned that the distinction between indexing a float axis by value (2.0) or integer location (1) is too subtle for general use.

It's worth noting that Panda's indexing scheme currently allows this in some regards: if an "axis" is float or integer, then by-axis-value lookup takes priority over location-based indexing in their default x[] API… but I've always found it to be surprising, and they're thinking about changing it. They're able to get away with this to some extent because of the way python objects expose the __array__ API separately from the x[] interface, and they additionally have dedicated x.loc[] and x.iloc[] methods for by-axis-value-lookup and by-location-indexing, respectively. All in all, it's fairly confusing… especially once you start considering slicing (which sometimes acts like closed intervals across the label locations).

A better example is xarray, where they have dramatically simplified this scheme: x[] is always location, x.loc[] is always axis-based. They've additionally added x.sel() and x.isel() APIs to enable specifying the axis name as a keyword argument, which is very clever, but has an unfortunate interaction with NumPy's heisenviews and assignment.


As far as repeated axis values go, this was implemented in #16 for a use-case in Sims.jl. It's not hard to check for multiple values upon scalar lookup and error. And to be honest, I was surprised at how little impact this had on the code.

Pandas allows duplicates but is "type-unstable" here, returning a scalar or vector if there are dupes. Interestingly, xarray also allow dupes and always returns xarrays from indexing… even scalar indexing by integer locations.