Open ZedThree opened 3 years ago
I did implement creating-in-memory in #132. I thought it was being used in most places (there's a save-to-disk argument to the input-creation functions that defaults to not writing). It's possible we need to switch more of the tests over. But I think the combining-into-a-single-Dataset operation may be relatively expensive. The data is created with a lot of small variations on the inputs in different tests or when parameterizing - it might still be worth caching the created datasets, and maybe trying to reduce the number of variations, but it's probably quite a faff to do!
I had an idea for caching the input data, to avoid re-creating it. Working on it now, PR hopefully coming soon...
Looking at how long different tests take, the biggest chunks of time are the parallel interpolation tests (which you'd expect to be computationally heavy because they have four FFT transforms as well as the interpolation stencil operation), and the Region
tests and TestPlot::test_region_disconnecteddoublenull
, which have a lot of different cases to cover. It might be worth someone taking a look at those two sets in particular: are there more cases than we really need? Could the tests be done with smaller arrays without compromising anything? For TestPlot::test_region_disconnecteddoublenull
specifically, it might be worth splitting into one test that includes all the plot functions currently included for one or a few combinations of input parameters, and a second that only uses one plot function (say pcolormesh
) to test all the parameters.
xarray 2022.10 causes further slow down of the tests -- deep copying seems to have gotten x20 slower
See, e.g. https://github.com/boutproject/xBOUT/pull/252#issuecomment-1282672540
For TestPlot::test_region_disconnecteddoublenull specifically, it might be worth splitting into one test that includes all the plot functions currently included for one or a few combinations of input parameters, and a second that only uses one plot function (say pcolormesh) to test all the parameters.
Sounds good
The deep copying is more expensive because it's now copying DataArrays in attributes -- and the Regions have two or three nested levels of this. So ways of removing that nesting are probably worth investigating. Can the Regions make use of views somehow perhaps?
hmm... wonder if we actually need to deep-copy at all? I suspect the idea at the time was that it would be safer, but maybe it's not needed...
It seems that prior to 2022.10, deep copying didn't deep copy the attributes (and therefore regions?), so presumably a shallow copy should be fine now?
I switched some deep copies to shallow ones and it did make a substantial difference to the one test I've been running, but it's still x10 slower than before.
Can I ask why xBOUT stores arrays of data in the attrs?
Whilst xarray's .attrs
can technically store arbitrary objects, it's intended for storing small metadata (e.g. names of units, provenance, explanation of what variables represent etc.). Putting entire xr.DataArrays in the .attrs is not an intended use at all. If xBOUT didn't put actual data arrays in the attrs then the difference between deep- or shallow-copying metadata wouldn't matter.
The DataArray
s in attrs were in the 'regions'. They were actually just a few scalars, so probably makes sense to switch stick in .values
to have them as just numbers.
This change is a 4-liner, and substantially reduces the number of errors we get from xarray-2022.9.0. I'll make a draft PR as a starting point for fixing #275.
The unit tests take really quite a long time to run. I used the pytest profiling extension, and it looks like creating and opening datasets takes the vast majority of the time.
Is there a lot of reading/writing to disk? Would it be possible to just do things in memory?