JuliaArrays / AxisArrays.jl

Performant arrays where each dimension can have a named axis with values
http://JuliaArrays.github.io/AxisArrays.jl/latest/
Other
200 stars 42 forks source link

Deprecate cat? #73

Closed GordStephen closed 7 years ago

GordStephen commented 7 years ago

As discussed in #25 and illustrated in #64, it's not clear whether concatenating arbitrary AxisArrays together makes sense. As far as I can tell merge and join provide the functionality that you would want from cat, provided your axes have meaningful names and values (and if they don't, why are you using an AxisArray?).

I'd be interested to know if:

  1. anyone is actually using cat
  2. anyone is actually using it for something merge/join can't do.

For some historical context, I implemented cat first and realised soon after that I needed something a bit smarter - that's when I added merge/join. Had I had a better understanding of my needs I likely wouldn't have added it to begin with. Now it just seems like a source of potential confusion and frustration: the same axis constraints that govern the other functions apply to cat as well, it just isn't smart enough to do the right thing when those constraints would be violated, and throws an error instead.

If cat is indeed redundant, it might make sense to either:

One issue with the second option is that cat implies a lower-level control over dimensional ordering that merge and join abstract away, so users may expect it to do things it won't. Of course, this is true of the current cat method as well.

timholy commented 7 years ago

I haven't followed the combining stuff at all (and I'm not using cat), but it occurs to me that both might be handled by cat(newaxis, A, B, ...)?

One problem with the current scheme is that keyword arguments are not inferrable, and in general I don't think either merge or join are inferrable (try inserting @inferred in various places in test/combine.jl).

Balinus commented 7 years ago

The problem I see with merge right now is the time it takes to merge 2 AxisArray. Right now, I'm trying to merge 2 AxisArray of size 366x1068x510. after 12 minutes, it is still not finished and it takes about 26 GB or RAM (and still rising). I killed the process before it went higher (shared server!).

If I simply use vcat, it's mostly instantaneous.

I will provide some more info when I get the time.

edit - To answer the question in 1st post: Yes, I'm using vcat in my package as a way to merge time-dependent AxisArrays (not using TimeAxisArrays yet).

GordStephen commented 7 years ago

That's a good datapoint, thanks!

Balinus commented 7 years ago

Here's a simple exercise that takes way too much RAM and time for merge to be useful (at least for Climate data, where this size of arrays are everyday work).

Again, I killed the process before it ended to avoid sys admin knocking at my door!

data = randn(400, 1000, 500);

julia> dataOut = AxisArray(data, Axis{:time}(1:400), Axis{:lon}(1:1000), Axis{:lat}(1:500));

julia> dataOut2 = AxisArray(data, Axis{:time}(401:800), Axis{:lon}(1:1000), Axis{:lat}(1:500));

julia> test = merge(dataOut, dataOut2)
Balinus commented 7 years ago

With a simple array of size 400x10x5, the merge method is about 380 times slower on our server. I can't see how merge can replace cat with such a penalty.

After warmpup

julia> using AxisArrays
julia> data = randn(400, 10, 5);
julia> A1 = AxisArray(data, Axis{:time}(1:400), Axis{:lon}(1:10), Axis{:lat}(1:5));
julia> A2 = AxisArray(data, Axis{:time}(401:800), Axis{:lon}(1:10), Axis{:lat}(1:5));

@time vcat(A1, A2)
  0.000499 seconds (133 allocations: 324.578 KB)
@time cat(1, A1, A2)
  0.000551 seconds (133 allocations: 324.578 KB)
@time merge(A1, A2)
  0.176711 seconds (1.14 M allocations: 49.500 MB, 6.34% gc time)
GordStephen commented 7 years ago

Ok, while merge is far from optimized right now (my new project for this weekend!), keeping cat around as a fast, bare-bones combination operation for well-behaved data seems compelling.