Speed up get_formatted_array

brynpickering commented 5 years ago

Problem description

Using get_formatted_array splits loc::techs and loc::tech::carriers string sets and interacts between xarray and pandas to produce a sparse matrix for easier indexing (e.g. summing over a single tech).

This can take a very long time for large DataArrays, and has been recorded as hitting memory limits for some devices.

So, it should be made more efficient. This could be a matter of defning loc::techs etc. as tuples instead of :: concatenated strings. Then they automagically are parsed as a MultiIndex, instead of needing to apply string operations.

Calliope version

0.6.3

timtroendle commented 5 years ago

Here's one measurement: for my 502 locs, 12 techs, 2208 timesteps model, model.get_formatted_array("carrier_prod") took more than 60 GB of RAM and 45 minutes to run before I killed it now (considering my computer has 16 GB of physical RAM, the runtime isn't to be taking seriously).

This is crashing runs on the cluster as I'm hitting memory limits. It's especially unfortunate as the model is already solved at that point of time. Furthermore, postprocessing now takes 1/3 or 1/4 of the entire computation time for my models (2-4h), a good part of which may be due to get_formatted_array. But more importantly, this seems to be setting the upper bound on the model size solvable on the cluster: with more efficient postprocessing, I could increase the problem size.

I will try and see whether I can improve the code. Using xarray.MultiIndex sounds like a clean and proper, but also excessive fix.

The cleanest solution would probably be to avoid loc_techs altogether, but that isn't possible before sparse matrices are implemented in xarray: https://github.com/pydata/xarray/issues/1375.

timtroendle commented 5 years ago

Memo to myself:

The problematic line creates a DataArray with in my case 502x507x2208~5e8 entries. MultiIndex works like a charm though, and this may be the solution:

updated_data_var = xr.DataArray(
     data_var_df.values, 
     [("loctechscarriers", data_var_df.index), ("timesteps", data_var_df.columns)]
)
updated_data_var.sel(locs="my-loc") # example use

It takes milliseconds to execute and hardly any RAM.

EDIT: That is, it's not the string formatting. No wonder, there are only about 1e3 strings to format in my case. Still, should all of this work, this could be a great general solution for avoiding loc_techs within Calliope.

brynpickering commented 5 years ago

@timtroendle, I actually just switched off postprocessing on the cluster in my runs, due to the same issue... Anyway, I had some stuff waiting to go on this, see PR #231 for a working branch that you could test with. It may still blow up on unstacking the MultiIndex (but my memory profiling suggests a much lower memory use than the previous incarnation of get_formatted_array.

I'm a but confused by your solution, how does it go from ("loctechscarriers", data_var_df.index) to being possible to select a location using ("loctechscarriers", data_var_df.index)? If it offers an even better solution, I'm happy to look at updating the PR in line with it.

timtroendle commented 5 years ago

@brynpickering I do not quite understand your solution and why it is so great, but it is so great! Memory and runtime is acceptable for my results: I think I saw 7 GB spikes and had a few seconds runtime. Now it sits there with about 4 GB RAM.

My solution was using a MultiIndex without unstacking it which leads to much much smaller arrays: considering they are as sparse as mine are, around three orders of magnitude. You can select locations and technologies as I show above, but you cannot do things like da.sum(dim='locs') because locs aren't a free dimension anymore (of course, that's the whole point of it).

I'll need to explore your solution a bit more, but for now it looks good for my issues at least. Mabye I'll use my own routine with MultiIndex as well, if I can work around the issues it has. And if that appears useful, we can think of adding it to the core, so that one can do:

model.get_formatted_array("sadas") # your approach
model.get_multiindexed_array("sadas") # multiindex approach

And should there be support for sparse matrices in xarray at one point, we can have the benefits of both approaches combined.

BTW, I didn't know you could switch off postprocessing -- was that a hack, or is there an option?

brynpickering commented 5 years ago

Great! Previously, it was turning the DataArray into a pandas dataframe, splitting the loc_tech index string, then turning it back into a DataArray. Now it is just turning the loc_tech index into a pandas index, doing a string operation to create a MultiIndex of locs and techs, replacing the loc_tech dimension with the new MultiIndex, and unstacking that MultiIndex. So, if you stopped this script just before that last step, you'd get the loc_tech (or loc_tech_carrier) index as a MultiIndex in the returned DataArray.

I'd be happy enough to have the option of returning just the MultiIndexed DataArray, given that the unstacking causes a small memory increase. Perhaps get_formatted_array could just include that as an optional argument?

BTW, I didn't know you could switch off postprocessing -- was that a hack, or is there an option?

Very much a hack...

timtroendle commented 5 years ago

I'd be happy enough to have the option of returning just the MultiIndexed DataArray, given that the unstacking causes a small memory increase. Perhaps get_formatted_array could just include that as an optional argument?

Good idea.

sjpfenninger commented 5 years ago

Some ideas in here for further improvements of how we deal with arrays..

calliope-project / calliope

Speed up get_formatted_array #170

Problem description

Calliope version