SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
635 stars 284 forks source link

Performance Shift(s): `bbe75acd` #4845

Closed trexfeathers closed 2 years ago

trexfeathers commented 2 years ago

Raising on behalf of the GHA workflow, which was having troubles at the time this originally came up!


Benchmark comparison has identified performance shifts at

Please review the report below and take corrective/congratulatory action as appropriate :slightly_smiling_face:

Performance shift report ``` before after ratio [0efd78d9] [bbe75acd] + 20.203125 54.70703125 2.71 experimental.ugrid.regions_combine.CombineRegionsComputeRealData.track_addedmem_compute_data(500) ```
trexfeathers commented 2 years ago

Not to self: this memory shift is within one Dask chunk size; run/write a scalability benchmark to work out if it's a problem for larger Cubes.

trexfeathers commented 2 years ago

Bad news! The regression still exists at scale. Will need to investigate the Dask and NumPy change logs.

nox --session="benchmarks(custom)" -- run 0efd78d9^..bbe75acd --attribute rounds=1 --show-stderr --bench=sperf.combine_regions.ComputeRealData.track_addedmem_compute_data


nox --session="benchmarks(custom)" -- compare 0efd78d9 bbe75acd

              5.0              5.0     1.00  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(100)
+     32.17578125     175.20703125     5.45  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1000)
+        225.4375      599.8046875     2.66  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)
+             5.0        6.5546875     1.31  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(200)
+      5.86328125      18.01953125     3.07  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(300)
+       20.359375      54.48046875     2.68  sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(500)

nox --session="benchmarks(custom)" -- publish '0efd78d9^..bbe75acd'

image
trexfeathers commented 2 years ago
pp-mo commented 2 years ago

I guess the key observation here is 225.4375 599.8046875 2.66 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)

However, the dask chunksize is 200Mb, and in my previous experience I think it typically uses about 3 chunks even when chunked operation is working correctly. The expected full data size of a re-combined cube in this case "ought" to be (6 1668 1668 4 bytes) (for float not double dtype), so that is only ~64Mb per data array. So I think this isn't quite a smoking gun, yet.

It could well be useful (again, in my experience!) to test with a lower chunksize, it's usually then pretty clear whether it is generally blowing memory or not. This way is handy : with dask.config.set('array.chunksize', '10Mb'): ... or whatever : note the string content method of specifying sizes!

trexfeathers commented 2 years ago

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

trexfeathers commented 2 years ago

The Dask changes seem very unlikely to be the cause, as they are the kind of things you would expect in a bugfix release. Unless anyone else thinks differently?

pp-mo commented 2 years ago

It could well be useful (again, in my experience!) to test with a lower chunksize

@pp-mo what difference would we be anticipating?

Well, hopefully, that the total memory cost would reduce, as a multiple of the chunksize. I.E. ultimately we expect it to be a minimum of (N chunks) and (M data-array-size), -- so that for sufficent datasize (and small enough chunksize), the memory reaches a plateau.

Within the "N" factor is also how many workers it can run in parallel (we expect threads in this case).
So you can either stop it doing that (go to synchronous, or restrict n-workers), or rely on knowledge of the platform.
Another practical learning point : if you allow parallel operation at all, I wouldn't use less than 4 workers. It doesn't seem to cope well with that.

trexfeathers commented 2 years ago

Changing the chunk size to 10Mib considerably reduced the memory used for commit 0efd78d9, but made no difference to bbe75acd. bbe75acd results are also suspiciously close to the estimated full sizes of those meshes, suggesting the entire thing is being realised. Now to work out why...

pp-mo commented 2 years ago

Updates + observations..

  1. Unfortunately, I was basically wrong about chunk sizes, since the test uses a cube of dims "(1, n_mesh_faces)", and the mesh dim cannot be chunked in the combine_regions calculations, which is like a regrid in that respect.

  2. testing with a param (cubesphere-edge-number) of 1000, we get 1M-points per face --> 6M points per cube --> 48Mbytes (since data is float64) which is all in one dask chunk (see previous)

  3. I created a really simple test to run outside of ASV , like this :

    tst = ComputeRealData()
    param = 1000
    tst.params = [param]
    # tst.setup_cache()   # only needed first time
    tst.setup(param)
    with TrackAddedMemoryAllocation() as mb:
    tst.recombined_cube.data
    print('Cube data size = ', tst.recombined_cube.core_data().nbytes * 1.0e-6)
    print('Operation measured Mb = ', mb.addedmem_mb())

    When initially run, this claimed that no memory at all was used (!) I think that is because the setup is run in the same process as the to-be-measured calc, whereas (also I think) ASV is running each test in its own fork, hence the RSS-based measure then works ( Actually I'm increasingly concerned that this memory measurement method is not great, and a tracemalloc approach would be a better choice anyway )

So I replaced the measurement with one based on tracemalloc, and got these results: BEFORE the lockfile change

Cube data size =  48.0
Operation measured Mb =  97.321608543396

AFTER the lockfile change

Cube data size =  48.0
Operation measured Mb =  228.9274206161499
  1. took a copy of the lockfile env + upgraded numpy from 1.22.4 to 1.23.0 Results are as latest above : I.E. it is the numpy change that was significant (and not dask)

  2. note that the memory numbers above

    • BEFORE ~= data-size * 2
    • AFTER = ~ data-size 5 latter is interesting because it is NOT 7 (there are 7 regions combined) so the problem is not "just" new space used up by each individual region-combine op
trexfeathers commented 2 years ago

Conclusion - won't fix

This regression is a moderate hindrance at worst, and is therefore only worth a limited amount of investigation.

More detail: we continue to discover new possible avenues of investigation, without any promising end in sight. The worst expected memory demand is <700MiB - a C1668 cubesphere horizontal slice (which must be a single chunk) - chunking will prevent larger sizes. @pp-mo may wish to add to this.

I'm writing a cautionary What's New entry to notify users of the increased memory demand.

pp-mo commented 2 years ago

worst expected is <700MiB

A bit more detail: From investigating effects of chunking and different numpy versions, we found that the total memory cost of the operation is something like "2 main-data-size + (2 or 3?) combined-region-data-sizes".

The region data can be chunked, but the main data chunks must have a full mesh dimension, or indexing errors will occur (which needs fixing in the code - see below).

We've now shown that smaller chunks can reduce the cost due to region-data handling (but not the main data).
The conclusion is that it might effectively be copying/storing all region data, perhaps x2-3, but this is only ~same as main data.
In particular, it is not costing "n-regions * full-data-size". From that it seems, that the additional cost comes in management of the region data, and not the central map_blocks operation -- as in that case it would scale as n-regions

Hence, we believe the total cost is limited to about 3 * full-data-array-size. Which is certainly larger, but not prohibitive. As this operation already can't chunk over the mesh dimension, we can accept costs of this magnitude. (though we also want to be quite sure that we can still chunk the operation within outer dimensions - see below)

pp-mo commented 2 years ago

TODO in a follow-on fixes/changes : see #4882 and #4883