SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
633 stars 283 forks source link

Code solutions for time-dependent hybrid height #6165

Open pp-mo opened 6 days ago

pp-mo commented 6 days ago

Not really an issue, but a place to record+discuss potential solutions to #5369

Initial approach is code for workaround, but given #6163 can hopefully develop into optional or automatic operation within iris load.

In that context, appropriate API for this will depend on the scope and possible side-effects of the eventual solution. I.E. it might either be fully automatic or involve FUTURE controls or additional enabling/disabling keywords.

pp-mo commented 6 days ago

Initial Code suggestion

from @stephenworsley

import iris
from iris.cube import CubeList
from iris.util import new_axis

files = [
    "/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163dec.pp",
    "/net/data/users/hadmm/time_varying_orog_example/cx209a.p42164jan.pp"
]

raw_cubes = iris.load(files, constraints="x_wind")
# raw_cubes = iris.load_raw(files, constraints="x_wind")
print(raw_cubes)

processed_cubes = CubeList(
    [new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in raw_cubes]
)
print(processed_cubes)

result = processed_cubes.concatenate_cube()
print(result)
print(result.coord("altitude"))

Output

0: x_wind / (m s-1)                    (model_level_number: 85; latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (model_level_number: 85; latitude: 144; longitude: 192)
0: x_wind / (m s-1)                    (time: 1; model_level_number: 85; latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (time: 1; model_level_number: 85; latitude: 144; longitude: 192)
x_wind / (m s-1)                    (time: 2; model_level_number: 85; latitude: 144; longitude: 192)
    Dimension coordinates:
        time                             x                      -             -               -
        model_level_number               -                      x             -               -
        latitude                         -                      -             x               -
        longitude                        -                      -             -               x
    Auxiliary coordinates:
        forecast_period                  x                      -             -               -
        surface_altitude                 x                      -             x               x
        level_height                     -                      x             -               -
        sigma                            -                      x             -               -
    Derived coordinates:
        altitude                         x                      x             x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
    Cell methods:
        0                           time: mean (interval: 1 hour)
    Attributes:
        STASH                       m01s00i002
        source                      'Data from Met Office Unified Model'
        um_version                  '12.1'
AuxCoord :  altitude / (m)
    points: <lazy>
    bounds: <lazy>
    shape: (2, 85, 144, 192)  bounds(2, 85, 144, 192, 2)
    dtype: float32
    standard_name: 'altitude'
    attributes:
        positive  'up'
pp-mo commented 6 days ago

Comments following above

@stephenworsley It looks like derived coordinates are being handled by new_axis and concatenate does the result look correct to you (Matt) ?

(Matt) Looks good to me -- interesting to see that the altitude data is also lazy

pp-mo commented 6 days ago

Further comments

@pp-mo I tried adding a third month (2063oct + 2063dec + 2064 jan). Then you get one cube (oct + dec) with 2 timepoints, and 1 cube with a scalar time (jan). So, I applied a transform like

if not cube.coord_dims('time'):
        cube = new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) 

But the problem is, the 2-timepoint cube already has a time dimension, but the surface_altitude naturally does not map to it. So you can't concatenate them after all. Those cubes cubes like (before new_axis processing)...

 ...
Cube # 1
x_wind / (m s-1)                    (latitude: 144; longitude: 192)
    Dimension coordinates:
        latitude                             x               -
        longitude                            -               x
    Auxiliary coordinates:
        surface_altitude                     x               x
    Derived coordinates:
        altitude                             x               x
    Scalar coordinates:
        forecast_period             1365480.0 hours, bound=(1365120.0, 1365840.0) hours
        forecast_reference_time     2006-01-01 00:00:00
        level_height                10.000004 m, bound=(0.0, 19.999998) m
        model_level_number          1
        sigma                       0.9988703, bound=(1.0, 0.9977413)
        time                        2164-01-16 00:00:00, bound=(2164-01-01 00:00:00, 2164-02-01 00:00:00)

so After new-axis processing

Cube # 0
x_wind / (m s-1)                    (time: 2; latitude: 144; longitude: 192)
    Dimension coordinates:
        time                             x            -               -
        latitude                         -            x               -
        longitude                        -            -               x
    Auxiliary coordinates:
        forecast_period                  x            -               -
        surface_altitude                 -            x               x
    Derived coordinates:
        altitude                         -            x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
        level_height                10.000004 m, bound=(0.0, 19.999998) m
        model_level_number          1
        sigma                       0.9988703, bound=(1.0, 0.9977413)
...
Cube # 1
x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
    Dimension coordinates:
        time                             x            -               -
        latitude                         -            x               -
        longitude                        -            -               x
    Auxiliary coordinates:
        forecast_period                  x            -               -
        surface_altitude                 x            x               x
    Derived coordinates:
        altitude                         x            x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
        level_height                10.000004 m, bound=(0.0, 19.999998) m
        model_level_number          1
        sigma                       0.9988703, bound=(1.0, 0.9977413)

So, those won't concatenate even though they do have a common time coordinate, because the surface_altitudes have different dimensions. It does work if you load_raw, process as above, then merge, then concatenate.
But it is rather nasty, and possibly slow. I think it would be preferable to use a plain 'load', and rework the results in some way. But basically I think you will need to re-make the factory somehow to do that.

Reply from @stephenworsley

Would it be sensible to load each file separately and then process and concatenate? Or does the way loading work make this a lot more inefficient?

pp-mo commented 6 days ago

Here's the above "clunky solution"

== load_raw with separate merge+concatenate steps ...

import iris
from iris.cube import CubeList
from iris.util import new_axis

files = [
    "/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163oct.pp",
    "/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163dec.pp",
    "/net/data/users/hadmm/time_varying_orog_example/cx209a.p42164jan.pp"
]

raw_cubes = iris.load_raw(files, constraints="x_wind")
print("RAW CUBES (", len(raw_cubes), ")")
print("[:2]...")
print(raw_cubes[:2])
print('  ...{-2:]')
print(raw_cubes[-2:])
print('raw_cubes[0]')
print(raw_cubes[0])

processed_cubes = CubeList(
    [new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in raw_cubes]
)
print("PROCESSED CUBES (", len(processed_cubes), ")")
print("[:2]...")
print(processed_cubes[:2])
print('  ...{-2:]')
print(processed_cubes[-2:])
print('processed_cubes[0]')
print(processed_cubes[0])

# Merge to sort out the model levels etc
merged_cubes = processed_cubes.merge()
print("MERGED CUBES (", len(merged_cubes), ")")
print(merged_cubes)
print('merged_cubes[0]')
print(merged_cubes[0])

# Finally concatenate to sort the time coords
result = merged_cubes.concatenate_cube()
print()
print("SINGLE RESULT CUBE")
print(result)
print()
print("result.coord('altitude')")
print(result.coord("altitude"))

Results

RAW CUBES ( 255 )
[:2]...
0: x_wind / (m s-1)                    (latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (latitude: 144; longitude: 192)
  ...{-2:]
0: x_wind / (m s-1)                    (latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (latitude: 144; longitude: 192)
raw_cubes[0]
x_wind / (m s-1)                    (latitude: 144; longitude: 192)
    Dimension coordinates:
        latitude                             x               -
        longitude                            -               x
    Auxiliary coordinates:
        surface_altitude                     x               x
    Derived coordinates:
        altitude                             x               x
    Scalar coordinates:
        forecast_period             1364760.0 hours, bound=(1364400.0, 1365120.0) hours
        forecast_reference_time     2006-01-01 00:00:00
        level_height                10.000004 m, bound=(0.0, 19.999998) m
        model_level_number          1
        sigma                       0.9988703, bound=(1.0, 0.9977413)
        time                        2163-12-16 00:00:00, bound=(2163-12-01 00:00:00, 2164-01-01 00:00:00)
    Cell methods:
        0                           time: mean (interval: 1 hour)
    Attributes:
        STASH                       m01s00i002
        source                      'Data from Met Office Unified Model'
        um_version                  '12.1'
PROCESSED CUBES ( 255 )
[:2]...
0: x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
  ...{-2:]
0: x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
processed_cubes[0]
x_wind / (m s-1)                    (time: 1; latitude: 144; longitude: 192)
    Dimension coordinates:
        time                             x            -               -
        latitude                         -            x               -
        longitude                        -            -               x
    Auxiliary coordinates:
        forecast_period                  x            -               -
        surface_altitude                 x            x               x
    Derived coordinates:
        altitude                         x            x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
        level_height                10.000004 m, bound=(0.0, 19.999998) m
        model_level_number          1
        sigma                       0.9988703, bound=(1.0, 0.9977413)
    Cell methods:
        0                           time: mean (interval: 1 hour)
    Attributes:
        STASH                       m01s00i002
        source                      'Data from Met Office Unified Model'
        um_version                  '12.1'
MERGED CUBES ( 3 )
0: x_wind / (m s-1)                    (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1)                    (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
2: x_wind / (m s-1)                    (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
merged_cubes[0]
x_wind / (m s-1)                    (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
    Dimension coordinates:
        model_level_number                             x         -            -               -
        time                                           -         x            -               -
        latitude                                       -         -            x               -
        longitude                                      -         -            -               x
    Auxiliary coordinates:
        level_height                                   x         -            -               -
        sigma                                          x         -            -               -
        forecast_period                                -         x            -               -
        surface_altitude                               -         x            x               x
    Derived coordinates:
        altitude                                       x         x            x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
    Cell methods:
        0                           time: mean (interval: 1 hour)
    Attributes:
        STASH                       m01s00i002
        source                      'Data from Met Office Unified Model'
        um_version                  '12.1'

SINGLE RESULT CUBE
x_wind / (m s-1)                    (model_level_number: 85; time: 3; latitude: 144; longitude: 192)
    Dimension coordinates:
        model_level_number                             x         -            -               -
        time                                           -         x            -               -
        latitude                                       -         -            x               -
        longitude                                      -         -            -               x
    Auxiliary coordinates:
        level_height                                   x         -            -               -
        sigma                                          x         -            -               -
        forecast_period                                -         x            -               -
        surface_altitude                               -         x            x               x
    Derived coordinates:
        altitude                                       x         x            x               x
    Scalar coordinates:
        forecast_reference_time     2006-01-01 00:00:00
    Cell methods:
        0                           time: mean (interval: 1 hour)
    Attributes:
        STASH                       m01s00i002
        source                      'Data from Met Office Unified Model'
        um_version                  '12.1'

result.coord('altitude')
AuxCoord :  altitude / (m)
    points: <lazy>
    bounds: <lazy>
    shape: (85, 3, 144, 192)  bounds(85, 3, 144, 192, 2)
    dtype: float32
    standard_name: 'altitude'
    attributes:
        positive  'up'
stephenworsley commented 5 days ago

Performance

Merging and concattenating can happen in any order and give the same result, however, it appears that performance is heavily impacted when this happens.

Loading 60 of the above files:

Note: when running against the current, unreleased version of iris, the performance of merge and concatenate is about the same for these files.

It should be noted that in this case merging first combines cubes from within the same file and concattenating first involves combining cubes from differerent files. It's unclear if this would

An alternate approach where each file is individually merged first, then newaxis is applied, then concatenate is applied, took ~40 seconds in total to run (including loading). This approach is faster, but relies upon the structure of the files, so may be less general. had a significant effect on performance.


from pathlib import Path

import iris
from iris.cube import CubeList
from iris.util import new_axis

path = Path("PATH_TO_FILES")
files = list(path.glob("cx209a.*.pp"))

cubes_by_file = CubeList([iris.load_cube(file, constraint="x_wind") for file in files])

processed_cubes = CubeList(
    [new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in cubes_by_file]
)

result = processed_cubes.concatenate_cube()

Comparing different file structures

To account for differences in file structure with respect to the model_level_number and time dimension, I reconstructed the original data into two collections of files, in one of which each file contained fields with constant time, in the other each file contained fields with constant model_level_number. The length of time and model_level_number in both of these collections was 10.

In both cases, merging first proved more efficient than concatenating first, at ~1 second to ~5 seconds (when running the latest version of iris this is closer to ~1 second vs ~2 seconds). It's worth noting that the above approach of merging each file in turn would have been impossible in the case where each file had constant model_level_number.

pp-mo commented 4 days ago

Investigating existing loading code, found that multiple factory references (like orography) are already merged on load. With a tweak, this can be extended to add the 'new_axis' call within the load chain See #6168

This effectively replicates the workaround solution, except the results still need a 'concatenate' after the merge.

However, this approach also seems to point to a fairly neat way to do auto-detection. See this line and note: this part is already in existing code

pp-mo commented 2 days ago

Investigating existing loading code, found that multiple factory references (like orography) are already merged on load. ... See #6168

Now updated with prototype loading-control object

Please take a look @stephenworsley @trexfeathers and see if you like the approach. TODO: no proper tests yet, and work to make load_cubes/load_cube do the same