ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Feature request: Extract_surface preprocessor #1039

Open ledm opened 3 years ago

ledm commented 3 years ago

I'm back! Hope you didn't miss me too much!

I've in the process of trying to extract the oceans surface layer from from a 3D dataset and the current method seems REALLY slow:

    extract_levels:
      levels: [0., ]
      scheme: nearest

The extract_levels called some functions with the _regrid module, which then tries to so some regridding in the z axis for every time step and it is really slow.

All I want is for a preprocessor to slice off the top layer in the Z direction. It should be really quick to write, test, and run. No-regridding required!

ie:

def extract_surface(cube):
""" Returns surface layer of 3D plus time cube.
"""
if cube.data.ndim==4:
    return cube[:, 0, :, :]

Thoughts?

@valeriupredoi Is this better or worse than a whatsapp message? lol.

bettina-gier commented 3 years ago

For atmospheric variables, e.g. co2 we have a derived variable co2s (https://github.com/ESMValGroup/ESMValCore/blob/master/esmvalcore/preprocessor/_derive/co2s.py) which is derived from surface pressure to actually determine which level of the 3D array corresponds to the "surface" - necessary when considering mountains or other elevated surfaces. Thus I wouldn't call this preprocessor you propose "extract_surface" as it might be confusing for atmospheric variables where people think of geographical surface. Is it possible to extend the extract_levels preprocessor with the option of no regridding instead?

bouweandela commented 3 years ago

Hi @ledm, good to have you back! What you describe is already sort of implemented here: https://github.com/ESMValGroup/ESMValCore/blob/9f603be2d726e99fcf069ad747ddc8a73a24e89a/esmvalcore/preprocessor/_regrid.py#L544-L550 If you give the extract_levels preprocessor function levels that are available in the cube, it will construct a constraint that selects the relevant data from the cube. The only time it will do vertical interpolation is if you request a level that is not there.

ledm commented 3 years ago

The issue with this option @bouweandela is that every model will have a different surface level thickness, and therefore a difference cell depth for the surface layer. Indeed, some models will have a different surface layer cell depth for every lat/lon coordinate at every point in time, so providing a level to match the model is not a functional solution.

Neither is setting a derived variable, as there are lots of datasets that we're only interested in the surface - remember that the surface is the only part of the ocean visible to satellites, so that's where a lot of the observations are! Derived variables are a lot of work for not a lot of gain.

In an ideal case, if we can't build a dedicated extract_surface preprocessor, could we provide a key word argument to the extract_levels preprocessor at the recipe stage, something like this:

    extract_levels:
      levels: sea_surface

ie:

 elif levels == 'sea_surface';
    if cube.data.ndim==4:
        result = cube[:, 0, :, :]

(but smarter and more flexible).

bouweandela commented 3 years ago

If you cannot do it by providing a level, it's probably no good to use the extact_levels preprocessor function for this. Maybe you could do something like #74, if you want to go for smarter and more flexible than just plain cube slicing?

ledm commented 3 years ago

The whole problem of this is that the current extract_level method, even when using the "nearest" scheme tries to regrid the entire dataset in the z-direction. This is done in the stratify module, which isn't lazy and it realises the data, in my case that 120GB of memory. It's also absurdly slow. Any thoughts on this @valeriupredoi, you were working on something yesterday?

bouweandela commented 3 years ago

This is done in the stratify module, which isn't lazy and it realises the data

I'm planning to start working on making extract_levels lazy soon now, I'm still stuck writing developer documentation, but after that is finished I hope I will have time to fix that.

ledm commented 3 years ago

In that case, included a special case for the ocean surface would definitely be appreciated (and much lazier than regridding).

ledm commented 3 years ago

In either case, I've written a extract_surface preprocessor here:

https://github.com/ESMValGroup/ESMValCore/tree/extract_surface

There's no tests, or documentation. It's not particularly clever it, but it runs the entire analysis I need in minutes, instead of the previous extract_levels preprocessor which tried to regrid the entire dataset, and failed to complete after running overnight and took 120GB of RAM.

valeriupredoi commented 3 years ago

In either case, I've written a extract_surface preprocessor here:

https://github.com/ESMValGroup/ESMValCore/tree/extract_surface

There's no tests, or documentation. It's not particularly clever it, but it runs the entire analysis I need in minutes, instead of the previous extract_levels preprocessor which tried to regrid the entire dataset, and failed to complete after running overnight and took 120GB of RAM.

great work! Could you please open a pull request with that (after adding tests and documentation) when you're ready - you can mark the PR as Draft while you're working on it :beer:

ledm commented 3 years ago

Is it worth making the effort to refine the branch for including in the master though?

It sounded like there's not a lot of appetite for including. Even in the extract bottom layer discussion in #975, both @zklaus and I didn't think there'd be much use for it. Though I guess that was before we found extract_layer was as slow over long time series of 3D+time data.

bouweandela commented 3 years ago

Is it worth making the effort to refine the branch for including in the master though?

That's up to you to decide, you probably know best what is useful for people working with ocean data. You might want to include the word sea or ocean in the name, just to make it absolutely clear that it shouldn't be used with atmospheric data.

ledm commented 3 years ago

Thinking about it, why wouldn't you be able to use it in the atmosphere or the land though? Are there many atmospheric models that have land masks for relief?

This preprocessor looks for the minimum of the absolute value of the z-axis, so as long as the z axis is set to zero, it should really work for ocean and land surface and some atmospshere models. Not sure about ice though.

It might not work if the z axis is non-monotonic, or if there closest layer to zero is has some kind of land mask.

In either case, the code is ready for a review.

bettina-gier commented 3 years ago

The most common atmospheric z coordinate is pressure, so ground level would be the highest z value, and some models over land have values which I assume are extrapolated, as they would be inside of e.g. mountain ranges, below the actual land surface - that's why just looking for the first value that isn't missing doesn't work either.

ledm commented 3 years ago

Would it be worth making this preprocessor more generic, allowing it to specific an integer value for the index along the z axis?

zklaus commented 3 years ago

Would it be worth making this preprocessor more generic, allowing it to specific an integer value for the index along the z axis?

That sounds a bit dangerous. Wouldn't that run the serious risk of mixing rather different heights? How would you handle the coordinates in that case?

bouweandela commented 3 years ago

I also think it goes a bit against the philosophy of the tool: we try to provide reliable results by using the metadata that is available, so plain cube slicing should be a last resort solution.

ledm commented 3 years ago

Perhaps this should be instead re-written as a special case of the extract _layer preprocessor?

ledm commented 1 year ago

I also think it goes a bit against the philosophy of the tool: we try to provide reliable results by using the metadata that is available, so plain cube slicing should be a last resort solution.

Lol, looks like I've had to re-invent this particular wheel at least twice now!