SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
630 stars 283 forks source link

Unify merge and concatenate #3234

Open kaedonkers opened 5 years ago

kaedonkers commented 5 years ago

There have been many comments through the years about combining the function of merge and concatenate to reduce the pain of turning many cubes into one. The main argument is that, while each function is doing a distinctly different thing, it should not matter from a user's point of view whether the cubes are merging from a scalar coordinate into a new dimension, or concatenating cubes onto an existing common dimension.

This issue acts as a placeholder to bring these issues together and assign this change to milestone Iris v3.0.\ The (potentially) related issues are listed below:

Please edit this list as appropriate.

kaedonkers commented 5 years ago

Ping @dkillick

bjlittle commented 5 years ago

@kaedonkers Just seeking clarification here... Are you proposing that merge and concatenate are combined in some way? Or are you just capturing all the issues regarding merge and concatenate in one place?

DPeterK commented 5 years ago

@bjlittle I'd love to see merge and concatenate become the same operation. Making a distinction between them is somewhat arbitrary, confusing for Iris users and perhaps primarily there to make life easy for devs and not users. Also merge is slow compared to concatenate, so if we could pull the functionality of concatenate in line with merge then we could remove merge and run everything through concatenate instead.

bjlittle commented 5 years ago

@dkillick I think the issue is that you can get different results given the order that you use i.e. merge + concatenate vs concatenate + merge or some sort of recursive combination of the two.

That's why we originally left it to the user to decide, rather than iris doing something magical.

Keeping them separate also feeds into the custom pipeline approach of processing data...

I'm not against bringing merge + concatenate together, but we'd need to think about that carefully.

Concatenate doesn't create new dimensions, it only works with existing dimensions, whereas merge will create new dimensionality in the cube (just making the point, rather than teaching you to suck :egg: 's)

Otherwise, we could have concatenate with automatic scalar coordinate self-promotion to a new dimension, which might do the trick.... I had an experimental branch that did this years ago, but got shot down at the time. Just thought I'd resurrect that notion as food for thought... :thinking:

rcomer commented 5 years ago

Would concatenate be able to cope with examples like #2761?

DPeterK commented 5 years ago

@rcomer as it stands, no, because I don't think concatenate can handle scalar anything. The intention here is that if we unified merge and concatenate then the single resultant thing would be able to handle scalars (and get right what merge currently isn't, as per your example).

bjlittle commented 5 years ago

@kaedonkers @dkillick @rcomer See here... couldn't resist 😉

bjlittle commented 5 years ago

So here's one possible unification proposal... note that, I'm initially aiming to build on what we have-ish, without a massive rewrite, although I'm not discounting that as a possibility.

Let's assume that the current recusive stratagem of concatenate is reasonably sound as a foundation from which to move forward, glazing over the obvious warts. So for me, the two big questions at this point are:

  1. how do we extend the dimensionality?
  2. how do we seed the dimensionality order efficiently?

Up front, I'm going to suppress the temptation to bias the design with any thoughts regarding optimisation through parallelism or sparse constructions of hypercubes. Let's keep it reasonably simple. For me, that's always a good place to start...

So going back to my comment above, I've previously toyed with the concept of concatenate automatically promoting a scalar coordinate to be a new singleton dimension. Then using the naive brute force approach inherent within concatenate, to automatically promote like scalar coordinates on candidate cubes in order to concatenate over the newly promoted dimension.

That's fine and dandy, but this only works for simple candidate cubes that differ by the promoted scalar coordinate. This simple approach doesn't work for a simple merge example such as:

A = [1, 1, 2, 2]
B = [3, 4, 3, 4]

Anyways, scalar coordinate promotion as a mechanism within concatenate addresses question [1.]

The interesting part is addressing question [2.]... and the leap of faith here is borrowing/leaning on the dimensionality smarts of merge to then seed the scalar coordinate promotion of concatenate in order to efficiently direct it to the desired hypercube dimensionality. However, the main point here is that the current concatenation logic requires to deal appropriately with duplicate scalar values on the newly promoted dimension i.e. A = [1, 1, 2, 2] becomes A = [1, 2] whilst ignoring the yet to be promoted scalar coordinates i.e. B = [3, 4, 3, 4] on the candidate cubes.

Given that, we then simply rinse and repeat, by promoting scalar coordinate B, which is the degenerate case for concatenate. Voilà?

The devil is most definitely in the detail, but there may be mileage with this approach... then again, may be not. But it does (to me) suggest there may be a ray of hope for possible unification...

rcomer commented 5 years ago

Looks like #2592 and #512 are sub-issues of #1987.