SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
626 stars 283 forks source link

Minimise the cube API #3429

Closed pp-mo closed 1 year ago

pp-mo commented 4 years ago

In my opinion, there is far too much in cube.py, and particularly the Cube class.

I would like to see this minimised by removing purely "convenience" instance methods : By this, I mean things which don't rely (?much) on private class implementation details, so they could just as well be cast as independent operations, which would take the cube as an argument. As 'utility' functions, these would then exist at the 'cube' package level or even in the generic iris.utils package.

In my mind, this means making a distinction between core operations which define what a cube "is" and are heavily concerned with the implementation, like add_dim_coord and __getitem__, and those which are secondary operations that just "use" cubes. Certainly, if an operation can be rewritten to work entirely via the public Cube API, then it does not belong in the class. Surprisingly, even functions like 'transpose' may fall into this category.

Specifically, I think all of the following could potentially be moved outside of the Cube class:

There is another set which could also be addressed, but which are currently in-place operations (see also #3430). If they were rewritten as functions which always return a new cube, then these could also be "taken out" :

There are some interesting parallels here with the numpy API. For instance you could remove all those array instance methods, like array.astype, array.reshape, array.transpose etc, when you can instead just use np.astype(array, X) etc. Most of those do simply redirect to static methods. I have also seen a comment on the numpy project that they "wish" they had reduced the proliferation of unnecessary instance methods, but it is now too late !

@njsmith If we could go back in time then I'd probably argue for not having these methods on ndarray at all, and using functions exclusively ... Too bad we don't have a time machine :-)

pp-mo commented 4 years ago

NOTE: I've labelled this for Iris 3.0, because it suggests a serious breaking change. Of course, we won't be doing this in Iris 3. Maybe I should start up an Iris 4.0 milestone ??

rcomer commented 4 years ago

:fearful:

I think I can see where you're coming from, but from a pure user perspective this might just look like "making us change our code for no good reason". I note that the comment you link specifically says that removing the numpy.ndarray methods is "impossible because it would break everyone's existing code that uses methods".

One big difference between the numpy and iris APIs is that all the numpy function are available in the top level package, so I am only ever 3 keystrokes (np.) away from the function I want to use. Iris has a proliferation of modules, making it less obvious where to find things. Having methods attached to the cube (and cubelist) makes them easy to find!

rcomer commented 4 years ago

Note that is has also previously been suggested that there is too much in iris.util.

pp-mo commented 4 years ago

I think I can see where you're coming from

TBH this is all about making the code tidier, which is basically for developers, not users !

A counter-example though is how hard it is to navigate the iris.cube docspage That is pretty bloated, is it not ?!?

pp-mo commented 4 years ago

Though, re:

proliferation of modules

I do enjoy a good multilevel hierachy, don't you ? :wink:

re:

there is too much in iris.util.

But that is really the same problem + I think supports my case, because that is just what's wrong with iris.util : it has no structure + is just a random collection of stuff.

rcomer commented 4 years ago

I do enjoy a good multilevel hierachy, don't you ? 😉

I couldn’t possibly comment 😆

I think it’s fair to say that a lot of functionality is not where you would put it if you were starting from scratch. But if we’re going to move things around, we need to consider that an awful lot of code has been written on top of Iris, so it should be as easy as possible to update said code to work with new versions. Perhaps consider creating tools to automate this? Like how 2to3 automatically changes next methods to functions.

rcomer commented 4 years ago

Or we could borrow the time machine from the numpy folks....

rcomer commented 4 years ago

NOTE: I've labelled this for Iris 3.0, because it suggests a serious breaking change.

BTW, are we still doing the deprecation thing? https://scitools.org.uk/iris/docs/latest/developers_guide/deprecations.html

pp-mo commented 4 years ago

are we still doing the deprecation thing?

Well, I still think so. But I also fear that we have often been over-cautious in the past :fearful: , when rigid rules have made it unreasonably hard to make simple changes for the better

For specific instance : did this issue today introduce a breaking change, and will that actually matter to anyone ? It's a judgement call.

rcomer commented 4 years ago

did this issue today introduce a breaking change

that link isn't working for me 😕

pp-mo commented 4 years ago

link isn't working

Sorry, I fixed it, meant to link #3431.

The form [link text](#issue-num) does not work !

pp-mo commented 4 years ago

borrow the time machine from the numpy folks

:grin: :chicken: :timer_clock: :gear: :hourglass_flowing_sand: :gear: :hatching_chick:

vsherratt commented 3 years ago

Can I just raise that this is a perfectly valid way to define a method:

def collapsed(self, coords, aggregator, **kwargs):
    pass  # actual contents of Cube.collapsed here...

class Cube:
    collapsed = collapsed

In fact, so is this (unless you've done something particularly egregious with metaclasses):

class Cube:
    pass

Cube.collapsed = collapsed

Basically there's nothing special about a def block inside a class block, or about naming a function argument self; the "magic" behind methods, ie automatically passing the caller instance in as the first argument, is inherent to the actual function type.

I think moving functions away from the class definition and into dedicated modules is a good idea regardless, 4000 lines is far too much for one file. So really I'd say the choices are:

1) Completely drop these functions from the Cube API, instead exposing them as functions across various modules - probably mostly iris.cube.* submodules, but I suppose some functions may fit better in other existing modules. As a breaking change, this would force a major version, and ideally deprecation warnings well beforehand.

2) Attach the functions as Cube methods, without putting the functions themselves in the public API. The change would purely be for developer convenience in dealing with more manageable files.

3) Expose both function and method options, similar to numpy. This is contentious. Certainly I wouldn't support writing a lot of one-line wrapper methods - that would be terrible for maintainability - but reusing the same function object in both situations is a strong option.

On one hand, it does break the "one obvious way to do something" guideline, and muddies the Cube namespace. Documentation could be more confusing, because sphinx will find and render exactly the same docstring in both cases; it will generate an appropriate function signature, but the hand-written argument lists and descriptions would either mention a cube argument that isn't there, or not mention one that is. Potentially addressable with a decorator that cleverly alters the docstring on a copy of the function, or just being more careful when writing them.

On the other, it doesn't force the user into either "always use methods" or "always use functions", nor force them to check all the time which to use because the package itself is inconsistent - they can choose whichever they prefer or fits in best with the rest of their project. Ambivalence on this choice of styles is arguably a good thing and a reasonable exception to the guideline, especially for numpy as such a core package, but just as applicable to iris.

A sub-option is to only do this temporarily, to provide an easier transition into the first option.