Open pp-mo opened 6 days ago
from @stephenworsley
import iris
from iris.cube import CubeList
from iris.util import new_axis
files = [
"/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163dec.pp",
"/net/data/users/hadmm/time_varying_orog_example/cx209a.p42164jan.pp"
]
raw_cubes = iris.load(files, constraints="x_wind")
# raw_cubes = iris.load_raw(files, constraints="x_wind")
print(raw_cubes)
processed_cubes = CubeList(
[new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in raw_cubes]
)
print(processed_cubes)
result = processed_cubes.concatenate_cube()
print(result)
print(result.coord("altitude"))
0: x_wind / (m s-1) (model_level_number: 85; latitude: 144; longitude: 192)
1: x_wind / (m s-1) (model_level_number: 85; latitude: 144; longitude: 192)
0: x_wind / (m s-1) (time: 1; model_level_number: 85; latitude: 144; longitude: 192)
1: x_wind / (m s-1) (time: 1; model_level_number: 85; latitude: 144; longitude: 192)
x_wind / (m s-1) (time: 2; model_level_number: 85; latitude: 144; longitude: 192)
Dimension coordinates:
time x - - -
model_level_number - x - -
latitude - - x -
longitude - - - x
Auxiliary coordinates:
forecast_period x - - -
surface_altitude x - x x
level_height - x - -
sigma - x - -
Derived coordinates:
altitude x x x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
Cell methods:
0 time: mean (interval: 1 hour)
Attributes:
STASH m01s00i002
source 'Data from Met Office Unified Model'
um_version '12.1'
AuxCoord : altitude / (m)
points: <lazy>
bounds: <lazy>
shape: (2, 85, 144, 192) bounds(2, 85, 144, 192, 2)
dtype: float32
standard_name: 'altitude'
attributes:
positive 'up'
@stephenworsley It looks like derived coordinates are being handled by new_axis and concatenate does the result look correct to you (Matt) ?
(Matt) Looks good to me -- interesting to see that the altitude data is also lazy
@pp-mo I tried adding a third month (2063oct + 2063dec + 2064 jan). Then you get one cube (oct + dec) with 2 timepoints, and 1 cube with a scalar time (jan). So, I applied a transform like
if not cube.coord_dims('time'):
cube = new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period"))
But the problem is, the 2-timepoint cube already has a time dimension, but the surface_altitude naturally does not map to it. So you can't concatenate them after all. Those cubes cubes like (before new_axis processing)...
...
Cube # 1
x_wind / (m s-1) (latitude: 144; longitude: 192)
Dimension coordinates:
latitude x -
longitude - x
Auxiliary coordinates:
surface_altitude x x
Derived coordinates:
altitude x x
Scalar coordinates:
forecast_period 1365480.0 hours, bound=(1365120.0, 1365840.0) hours
forecast_reference_time 2006-01-01 00:00:00
level_height 10.000004 m, bound=(0.0, 19.999998) m
model_level_number 1
sigma 0.9988703, bound=(1.0, 0.9977413)
time 2164-01-16 00:00:00, bound=(2164-01-01 00:00:00, 2164-02-01 00:00:00)
so After new-axis processing
Cube # 0
x_wind / (m s-1) (time: 2; latitude: 144; longitude: 192)
Dimension coordinates:
time x - -
latitude - x -
longitude - - x
Auxiliary coordinates:
forecast_period x - -
surface_altitude - x x
Derived coordinates:
altitude - x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
level_height 10.000004 m, bound=(0.0, 19.999998) m
model_level_number 1
sigma 0.9988703, bound=(1.0, 0.9977413)
...
Cube # 1
x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
Dimension coordinates:
time x - -
latitude - x -
longitude - - x
Auxiliary coordinates:
forecast_period x - -
surface_altitude x x x
Derived coordinates:
altitude x x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
level_height 10.000004 m, bound=(0.0, 19.999998) m
model_level_number 1
sigma 0.9988703, bound=(1.0, 0.9977413)
So, those won't concatenate even though they do have a common time coordinate, because the surface_altitudes have different dimensions.
It does work if you load_raw, process as above, then merge, then concatenate.
But it is rather nasty, and possibly slow.
I think it would be preferable to use a plain 'load', and rework the results in some way. But basically I think you will need to re-make the factory somehow to do that.
Would it be sensible to load each file separately and then process and concatenate? Or does the way loading work make this a lot more inefficient?
== load_raw with separate merge+concatenate steps ...
import iris
from iris.cube import CubeList
from iris.util import new_axis
files = [
"/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163oct.pp",
"/net/data/users/hadmm/time_varying_orog_example/cx209a.p42163dec.pp",
"/net/data/users/hadmm/time_varying_orog_example/cx209a.p42164jan.pp"
]
raw_cubes = iris.load_raw(files, constraints="x_wind")
print("RAW CUBES (", len(raw_cubes), ")")
print("[:2]...")
print(raw_cubes[:2])
print(' ...{-2:]')
print(raw_cubes[-2:])
print('raw_cubes[0]')
print(raw_cubes[0])
processed_cubes = CubeList(
[new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in raw_cubes]
)
print("PROCESSED CUBES (", len(processed_cubes), ")")
print("[:2]...")
print(processed_cubes[:2])
print(' ...{-2:]')
print(processed_cubes[-2:])
print('processed_cubes[0]')
print(processed_cubes[0])
# Merge to sort out the model levels etc
merged_cubes = processed_cubes.merge()
print("MERGED CUBES (", len(merged_cubes), ")")
print(merged_cubes)
print('merged_cubes[0]')
print(merged_cubes[0])
# Finally concatenate to sort the time coords
result = merged_cubes.concatenate_cube()
print()
print("SINGLE RESULT CUBE")
print(result)
print()
print("result.coord('altitude')")
print(result.coord("altitude"))
RAW CUBES ( 255 )
[:2]...
0: x_wind / (m s-1) (latitude: 144; longitude: 192)
1: x_wind / (m s-1) (latitude: 144; longitude: 192)
...{-2:]
0: x_wind / (m s-1) (latitude: 144; longitude: 192)
1: x_wind / (m s-1) (latitude: 144; longitude: 192)
raw_cubes[0]
x_wind / (m s-1) (latitude: 144; longitude: 192)
Dimension coordinates:
latitude x -
longitude - x
Auxiliary coordinates:
surface_altitude x x
Derived coordinates:
altitude x x
Scalar coordinates:
forecast_period 1364760.0 hours, bound=(1364400.0, 1365120.0) hours
forecast_reference_time 2006-01-01 00:00:00
level_height 10.000004 m, bound=(0.0, 19.999998) m
model_level_number 1
sigma 0.9988703, bound=(1.0, 0.9977413)
time 2163-12-16 00:00:00, bound=(2163-12-01 00:00:00, 2164-01-01 00:00:00)
Cell methods:
0 time: mean (interval: 1 hour)
Attributes:
STASH m01s00i002
source 'Data from Met Office Unified Model'
um_version '12.1'
PROCESSED CUBES ( 255 )
[:2]...
0: x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
...{-2:]
0: x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
processed_cubes[0]
x_wind / (m s-1) (time: 1; latitude: 144; longitude: 192)
Dimension coordinates:
time x - -
latitude - x -
longitude - - x
Auxiliary coordinates:
forecast_period x - -
surface_altitude x x x
Derived coordinates:
altitude x x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
level_height 10.000004 m, bound=(0.0, 19.999998) m
model_level_number 1
sigma 0.9988703, bound=(1.0, 0.9977413)
Cell methods:
0 time: mean (interval: 1 hour)
Attributes:
STASH m01s00i002
source 'Data from Met Office Unified Model'
um_version '12.1'
MERGED CUBES ( 3 )
0: x_wind / (m s-1) (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
1: x_wind / (m s-1) (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
2: x_wind / (m s-1) (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
merged_cubes[0]
x_wind / (m s-1) (model_level_number: 85; time: 1; latitude: 144; longitude: 192)
Dimension coordinates:
model_level_number x - - -
time - x - -
latitude - - x -
longitude - - - x
Auxiliary coordinates:
level_height x - - -
sigma x - - -
forecast_period - x - -
surface_altitude - x x x
Derived coordinates:
altitude x x x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
Cell methods:
0 time: mean (interval: 1 hour)
Attributes:
STASH m01s00i002
source 'Data from Met Office Unified Model'
um_version '12.1'
SINGLE RESULT CUBE
x_wind / (m s-1) (model_level_number: 85; time: 3; latitude: 144; longitude: 192)
Dimension coordinates:
model_level_number x - - -
time - x - -
latitude - - x -
longitude - - - x
Auxiliary coordinates:
level_height x - - -
sigma x - - -
forecast_period - x - -
surface_altitude - x x x
Derived coordinates:
altitude x x x x
Scalar coordinates:
forecast_reference_time 2006-01-01 00:00:00
Cell methods:
0 time: mean (interval: 1 hour)
Attributes:
STASH m01s00i002
source 'Data from Met Office Unified Model'
um_version '12.1'
result.coord('altitude')
AuxCoord : altitude / (m)
points: <lazy>
bounds: <lazy>
shape: (85, 3, 144, 192) bounds(85, 3, 144, 192, 2)
dtype: float32
standard_name: 'altitude'
attributes:
positive 'up'
Merging and concattenating can happen in any order and give the same result, however, it appears that performance is heavily impacted when this happens.
Loading 60 of the above files:
newaxis
takes ~20 secondsNote: when running against the current, unreleased version of iris, the performance of merge and concatenate is about the same for these files.
It should be noted that in this case merging first combines cubes from within the same file and concattenating first involves combining cubes from differerent files. It's unclear if this would
An alternate approach where each file is individually merged first, then newaxis is applied, then concatenate is applied, took ~40 seconds in total to run (including loading). This approach is faster, but relies upon the structure of the files, so may be less general. had a significant effect on performance.
from pathlib import Path
import iris
from iris.cube import CubeList
from iris.util import new_axis
path = Path("PATH_TO_FILES")
files = list(path.glob("cx209a.*.pp"))
cubes_by_file = CubeList([iris.load_cube(file, constraint="x_wind") for file in files])
processed_cubes = CubeList(
[new_axis(cube, scalar_coord="time", expand_extras=("surface_altitude", "forecast_period")) for cube in cubes_by_file]
)
result = processed_cubes.concatenate_cube()
To account for differences in file structure with respect to the model_level_number
and time
dimension, I reconstructed the original data into two collections of files, in one of which each file contained fields with constant time
, in the other each file contained fields with constant model_level_number
. The length of time
and model_level_number
in both of these collections was 10.
In both cases, merging first proved more efficient than concatenating first, at ~1 second to ~5 seconds (when running the latest version of iris this is closer to ~1 second vs ~2 seconds). It's worth noting that the above approach of merging each file in turn would have been impossible in the case where each file had constant model_level_number
.
Investigating existing loading code, found that multiple factory references (like orography) are already merged on load. With a tweak, this can be extended to add the 'new_axis' call within the load chain See #6168
This effectively replicates the workaround solution, except the results still need a 'concatenate' after the merge.
However, this approach also seems to point to a fairly neat way to do auto-detection. See this line and note: this part is already in existing code
Investigating existing loading code, found that multiple factory references (like orography) are already merged on load. ... See #6168
Now updated with prototype loading-control object
Please take a look @stephenworsley @trexfeathers and see if you like the approach. TODO: no proper tests yet, and work to make load_cubes/load_cube do the same
Not really an issue, but a place to record+discuss potential solutions to #5369
Initial approach is code for workaround, but given #6163 can hopefully develop into optional or automatic operation within iris load.
In that context, appropriate API for this will depend on the scope and possible side-effects of the eventual solution. I.E. it might either be fully automatic or involve FUTURE controls or additional enabling/disabling keywords.