Poor performance of iris with latest dask

jamesp commented 2 years ago

🐛 Bug Report

Reading netcdf and pp files in iris with the latest version of dask is slower and takes a lot more memory.

I'm not familiar enough with the mechanics of iris loading to understand whether this is a dask issue or iris issue. I suspect it may be dask, if so I'd appreciate any help in pinpointing the specific thing I need to raise a bug report with dask about.

The attached script creates a 4D cube full of random numbers and saves to pp and netcdf. It then loads the file and performs the following operations:

load&extract: constrain to single time and lat when loading, then take lon mean:

cube = iris.load_cube(file, t_const & lat_const)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
coll.data  # realise the data

load,extract: take lon mean, then constrain single time and lat:

cube = iris.load_cube(file)
coll = cube.collapsed('longitude', iris.analysis.MEAN)
ext = cube.extract(t_const & lat_const)
ext.data  # realise the data

Both should end up with identical results. Here is the time taken and peak memory usage for iris 3.2 on my machine

netcdf file:	dask version	2021.6	2022.2
load&extract	0.15s, 1MB	0.36s, 204MB
load,extract	0.3s, 204MB	0.34s, 204MB

More than 2x slower
load&extract used to be lazy, now data is realised

pp file:	dask version	2021.6	2022.2
load&extract	5.18s, 13MB	5.76s, 5MB
load,extract	1.81s, 14MB	2.16s, 7MB

both operations slower
why is load then extract faster than load and extract, despite doing far more work? it is taking the mean at all time and lat points but still more than 2x faster. This is not just for the latest dask version, this applies to all versions

How To Reproduce

Steps to reproduce the behaviour:

Download the file
mamba create -y -c conda-forge -n iris-3.2-dask-2021.6 iris=3.2 dask=2021.6.0
run script in environment.
mamba create -y -c conda-forge -n iris-3.2-dask-2022.2 iris=3.2 dask=2022.2.0
run script in environment.

Expected behaviour

Performance should be better in later versions of iris and dask.

Environment

OS & Version: linux, conda environment
Iris Version: 3.2

Analysis script

Expand for full script to cut-paste

```python from contextlib import contextmanager import sys import tracemalloc import time import cf_units import iris import iris.cube import numpy as np def log(msg): print(msg, file=sys.stderr) class PerfMon: """A Simple time and memory profiler.""" def __init__(self): self.runs = [] @contextmanager def __call__(self, tag: str): tracemalloc.start() start = time.perf_counter() yield stop = time.perf_counter() _, peak_mem = tracemalloc.get_traced_memory() tracemalloc.stop() data = (stop - start, peak_mem / 1024**2) self.runs.append((tag, data)) def results(self): print("{:<20s} {:>12s} {:>12s}".format("tag", "time (s)", "mem (mb)")) for tag, data in self.runs: print("{:20s} {:12.2f} {:12.0f}".format(tag, *data)) nt = 20 nz = 19 ny = 325 nx = 432 log('. generating data') c = iris.cube.Cube( np.random.normal(size=(nt,nz,ny,nx)), dim_coords_and_dims=[ (iris.coords.DimCoord(np.arange(nt), 'time', units=cf_units.Unit('days since 1970-01-01 00:00:00', calendar='gregorian')), 0), (iris.coords.DimCoord(np.arange(nz), 'height'), 1), (iris.coords.DimCoord(np.linspace(-90, 90, ny, endpoint=True), 'latitude'), 2), (iris.coords.DimCoord(np.linspace(0., 360, nx, endpoint=False), 'longitude'), 3) ], standard_name='atmosphere_relative_vorticity' ) t_point = c.coord('time').cell(4) t_const = iris.Constraint(time=lambda x: x == t_point) lat_point = c.coord('latitude').cell(45) lat_const = iris.Constraint(latitude=lambda x: x == lat_point) log('. saving test files') iris.save(c, 'test.pp') iris.save(c, 'test.nc') # do some very simple performance tests: # 1. load the file # 2. calculate the zonal mean # 3. realise the data # # compare: # - pp vs netcdf performance # - extracting (t, lat) at load or after collapse perf = PerfMon() log('. perf test 1') with perf('pp load&extract'): cube = iris.load_cube('test.pp', t_const & lat_const) coll = cube.collapsed('longitude', iris.analysis.MEAN) res = coll res.data log('. perf test 2') with perf('pp load,extract'): cube = iris.load_cube('test.pp') coll = cube.collapsed('longitude', iris.analysis.MEAN) res = coll.extract(t_const & lat_const) res.data log('. perf test 3') with perf('nc load&extract'): cube = iris.load_cube('test.nc', t_const & lat_const) coll = cube.collapsed('longitude', iris.analysis.MEAN) res = coll res.data log('. perf test 4') with perf('nc load,extract'): cube = iris.load_cube('test.nc') coll = cube.collapsed('longitude', iris.analysis.MEAN) res = coll.extract(t_const & lat_const) res.data perf.results() ```

trexfeathers commented 2 years ago

why is load then extract faster than load and extract, despite doing far more work?

@bjlittle explained to me last week that constraints during FF/PP load are applied once to each field, so many very small chunks of memory - suboptimal. By contrast, using extract allows Dask to load optimal sized memory chunks - many fields at once - and apply the constraint once per chunk.

trexfeathers commented 2 years ago

As for Dask: for me this is just more support for #4572 and similar. Dask's default chunk handling frequently changes, AFAICT based on the assumption that users can flexibly adapt the chunking in their scripts. As we know Iris currently buries the chunking control inside the loading automation, risking each Dask improvement becoming a 'breaking performance change'.

rcomer commented 2 years ago

That PP load&extract gets a lot quicker if you simplify the latitude constraint

lat_point = c.coord('latitude').cell(45)
lat_const = iris.Constraint(latitude=lat_point.point)

Possibly because of https://github.com/SciTools/iris/blob/caae45075795c1e13fbb0f04488c3b99f15b60e8/lib/iris/_constraints.py#L308-L315

wjbenfold commented 2 years ago

load&extract used to be lazy, now data is realised

@jamesp where are you seeing this? When I run your code then ask if the cube is lazy it claims to be

jamesp commented 2 years ago

I'm seeing it in the memory footprint that is recorded by tracemalloc shown in the table above. The numbers above suggest that a lot more of cube is being realised at some point, or my crude method of measuring memory performance is wrong.

From our recent dash on MeshCoords (#4749) I understand a bit more about how the real/lazy mechanics of iris works: it seems very possible that I could realise some/all of the data in a cube and yet it would still claim to be lazy.

wjbenfold commented 2 years ago

PP load with various environments, using the user-provided file that prompted @jamesp's concern:

tag                      time (s)     mem (mb)
pp load&extract user        44.44           15
pp load,extract user        16.90           56
(iris-3.2-dask-2021.6)

tag                      time (s)     mem (mb)
pp load&extract user        46.81            8
pp load,extract user        17.45           61
(iris-3.2-dask-2022.2)

tag                      time (s)     mem (mb)
pp load&extract user        26.55            8
pp load,extract user        16.84           71
(default-current)

tag                      time (s)     mem (mb)
pp load&extract user        30.28           14
pp load,extract user        18.32          143
(production-os45-1)

We also worked out earlier that the netCDF file isn't realising everything, it's just pulling the entire chunk into memory (i.e. with a 2GB file it still pulls in about 200MB but doesn't pull the full 2GB)

trexfeathers commented 2 years ago

the netCDF file isn't realising everything, it's just pulling the entire chunk into memory

Oh that's fine then, it's only the worst of both worlds 😆

wjbenfold commented 2 years ago

Oh that's fine then, it's only the worst of both worlds 😆

I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)

trexfeathers commented 2 years ago

Oh you're right I misinterpreted this - I'm too ready to hear "poor chunking means the while file is loaded".

pp-mo commented 2 years ago

the netCDF file isn't realising everything, it's just pulling the entire chunk into memory

Oh that's fine then, it's only the worst of both worlds 😆

I don't know, I thought it would have to pull things in at at least the chunk level to read them, so it seems kind of alright? (at least it doesn't pull 2GB into memory - I reckon that's a win! :p)

Well yes, of course it must load "some few" chunks at some time.
The chunksize default is now configurable via Dask, where it defaults to (I believe) about 100Mb : code here gets the target size from dask default:

>>> import dask.config
>>> import dask.array
>>> dask.config.get("array.chunk-size")
'128MiB'
>>>

In practice, I've seen that you can expect about 3* this.
But I suspect it may not be deterministic how soon it releases memory afterward : it may be ~like the Python garbage collector.

github-actions[bot] commented 1 year ago

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

SciTools / iris