Closed trexfeathers closed 2 years ago
Not to self: this memory shift is within one Dask chunk size; run/write a scalability benchmark to work out if it's a problem for larger Cube
s.
Bad news! The regression still exists at scale. Will need to investigate the Dask and NumPy change logs.
nox --session="benchmarks(custom)" -- run 0efd78d9^..bbe75acd --attribute rounds=1 --show-stderr --bench=sperf.combine_regions.ComputeRealData.track_addedmem_compute_data
nox --session="benchmarks(custom)" -- compare 0efd78d9 bbe75acd
5.0 5.0 1.00 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(100)
+ 32.17578125 175.20703125 5.45 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1000)
+ 225.4375 599.8046875 2.66 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)
+ 5.0 6.5546875 1.31 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(200)
+ 5.86328125 18.01953125 3.07 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(300)
+ 20.359375 54.48046875 2.68 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(500)
nox --session="benchmarks(custom)" -- publish '0efd78d9^..bbe75acd'
I guess the key observation here is
225.4375 599.8046875 2.66 sperf.combine_regions.ComputeRealData.track_addedmem_compute_data(1668)
However, the dask chunksize is 200Mb, and in my previous experience I think it typically uses about 3 chunks even when chunked operation is working correctly. The expected full data size of a re-combined cube in this case "ought" to be (6 1668 1668 4 bytes) (for float not double dtype), so that is only ~64Mb per data array. So I think this isn't quite a smoking gun, yet.
It could well be useful (again, in my experience!) to test with a lower chunksize, it's usually then pretty clear whether it is generally blowing memory or not.
This way is handy :
with dask.config.set('array.chunksize', '10Mb'): ...
or whatever : note the string content method of specifying sizes!
It could well be useful (again, in my experience!) to test with a lower chunksize
@pp-mo what difference would we be anticipating?
The Dask changes seem very unlikely to be the cause, as they are the kind of things you would expect in a bugfix release. Unless anyone else thinks differently?
It could well be useful (again, in my experience!) to test with a lower chunksize
@pp-mo what difference would we be anticipating?
Well, hopefully, that the total memory cost would reduce, as a multiple of the chunksize. I.E. ultimately we expect it to be a minimum of (N chunks) and (M data-array-size), -- so that for sufficent datasize (and small enough chunksize), the memory reaches a plateau.
Within the "N" factor is also how many workers it can run in parallel (we expect threads in this case).
So you can either stop it doing that (go to synchronous, or restrict n-workers), or rely on knowledge of the platform.
Another practical learning point : if you allow parallel operation at all, I wouldn't use less than 4 workers. It doesn't seem to cope well with that.
Changing the chunk size to 10Mib
considerably reduced the memory used for commit 0efd78d9, but made no difference to bbe75acd. bbe75acd results are also suspiciously close to the estimated full sizes of those meshes, suggesting the entire thing is being realised. Now to work out why...
Updates + observations..
Unfortunately, I was basically wrong about chunk sizes, since the test uses a cube of dims "(1, n_mesh_faces)", and the mesh dim cannot be chunked in the combine_regions calculations, which is like a regrid in that respect.
testing with a param (cubesphere-edge-number) of 1000, we get 1M-points per face --> 6M points per cube --> 48Mbytes (since data is float64) which is all in one dask chunk (see previous)
I created a really simple test to run outside of ASV , like this :
tst = ComputeRealData()
param = 1000
tst.params = [param]
# tst.setup_cache() # only needed first time
tst.setup(param)
with TrackAddedMemoryAllocation() as mb:
tst.recombined_cube.data
print('Cube data size = ', tst.recombined_cube.core_data().nbytes * 1.0e-6)
print('Operation measured Mb = ', mb.addedmem_mb())
When initially run, this claimed that no memory at all was used (!) I think that is because the setup is run in the same process as the to-be-measured calc, whereas (also I think) ASV is running each test in its own fork, hence the RSS-based measure then works ( Actually I'm increasingly concerned that this memory measurement method is not great, and a tracemalloc approach would be a better choice anyway )
So I replaced the measurement with one based on tracemalloc, and got these results: BEFORE the lockfile change
Cube data size = 48.0
Operation measured Mb = 97.321608543396
AFTER the lockfile change
Cube data size = 48.0
Operation measured Mb = 228.9274206161499
took a copy of the lockfile env + upgraded numpy from 1.22.4 to 1.23.0 Results are as latest above : I.E. it is the numpy change that was significant (and not dask)
note that the memory numbers above
This regression is a moderate hindrance at worst, and is therefore only worth a limited amount of investigation.
More detail: we continue to discover new possible avenues of investigation, without any promising end in sight. The worst expected memory demand is <700MiB
- a C1668
cubesphere horizontal slice (which must be a single chunk) - chunking will prevent larger sizes. @pp-mo may wish to add to this.
I'm writing a cautionary What's New entry to notify users of the increased memory demand.
worst expected is <700MiB
A bit more detail: From investigating effects of chunking and different numpy versions, we found that the total memory cost of the operation is something like "2 main-data-size + (2 or 3?) combined-region-data-sizes".
The region data can be chunked, but the main data chunks must have a full mesh dimension, or indexing errors will occur (which needs fixing in the code - see below).
We've now shown that smaller chunks can reduce the cost due to region-data handling (but not the main data).
The conclusion is that it might effectively be copying/storing all region data, perhaps x2-3, but this is only ~same as main data.
In particular, it is not costing "n-regions * full-data-size".
From that it seems, that the additional cost comes in management of the region data, and not the central map_blocks operation -- as in that case it would scale as n-regions
Hence, we believe the total cost is limited to about 3 * full-data-array-size. Which is certainly larger, but not prohibitive. As this operation already can't chunk over the mesh dimension, we can accept costs of this magnitude. (though we also want to be quite sure that we can still chunk the operation within outer dimensions - see below)
TODO in a follow-on fixes/changes : see #4882 and #4883
Raising on behalf of the GHA workflow, which was having troubles at the time this originally came up!
Benchmark comparison has identified performance shifts at
Please review the report below and take corrective/congratulatory action as appropriate :slightly_smiling_face:
Performance shift report
``` before after ratio [0efd78d9] [bbe75acd] + 20.203125 54.70703125 2.71 experimental.ugrid.regions_combine.CombineRegionsComputeRealData.track_addedmem_compute_data(500) ```