Use of native sparse array support in xarray / pandas / netCDF

irm-codebase commented 3 months ago

What can be improved?

Calliope should be more memory efficient! <- Finally this one is applicable

While checking how to better support sparsity in our ecosystem, I found about sparse. It is quite literally focused on the common use-case of super sparse data in ESOMs.

After some investigation, it seems like xarray either has, or is planning to, roll out support for this library: https://github.com/pydata/xarray/issues/3213. pandas seems to have also rolled out support.

I propose to evaluate how to integrate this into calliope, with two key design goals in mind:

Cut out "fattening" that happens in herently in xarray setups.
Find a way to refer to the dim itself in our constraints for cases were they are ordered. Two use cases (for pathways):
- Being able to do "sum from year X to year Y"
- Being able to do use "year current - year installed", and then use it as a reference to get a time-sensitive parameter that depends on it (e.g., by passing it to an inner function like gamma, a linear efficiency decrease, etc).

Version

v0.7

brynpickering commented 3 months ago

We've investigated sparse arrays a few times in the past. The problem is the data types that can be handled sparsely. If it can now handle pyomo/gurobi objects then that's great!

irm-codebase commented 3 months ago

Can you elaborate on the incompatibility? The linked thread has MESSAGEix and PyPSA modellers, so it might be possible through some method. Here is another thread that goes in detail for xarray: https://github.com/pydata/xarray/issues/1375

brynpickering commented 2 months ago

I did some tests on this locally and it doesn't work out-of-the-box. Operations between sparse arrays don't work when the data types are objects, e.g.:

import calliope
import xarray

m = calliope.examples.urban_scale()
m.build(force=True)

# you can't transform an existing array into sparse, so this is a quick hack.
foo = xr.DataArray.from_series(m.backend.variables.flow_cap.to_series(), sparse=True)
bar = xr.DataArray.from_series(m.backend.parameters.flow_cap_max.to_series(), sparse=True)

foo * bar

[Out] TypeError: Implicit conversion of Pyomo numeric value (parameters[flow_cap_max][12]*variables[flow_cap][0]) to float is disabled.
This error is often the result of using Pyomo components as arguments to
one of the Python built-in math module functions when defining
expressions. Avoid this error by using Pyomo-provided math functions or
explicitly resolving the numeric value using the Pyomo value() function.

I've also tried setting the fill value of foo and bar to non-NaN vals and using np.multiply(foo, bar, dtype=np.object_). These all end up creating dtype issues in the scipy.sparse library, e.g.:

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:542, in _Elemwise._get_fill_value(self)
    540 # Store dtype separately if needed.
    541 if self.dtype is not None:
--> 542     fill_value = fill_value.astype(self.dtype)
    544 self.fill_value = fill_value
    545 self.dtype = self.fill_value.dtype

AttributeError: 'float' object has no attribute 'astype'

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:524, in _Elemwise._get_fill_value(self)
    521     fill_value_array = self.func(*np.broadcast_arrays(*zero_args), **self.kwargs)
    523 try:
--> 524     fill_value = fill_value_array[(0,) * fill_value_array.ndim]
    525 except IndexError:
    526     zero_args = tuple(
    527         arg.fill_value if isinstance(arg, COO) else _zero_of_dtype(arg.dtype) for arg in self.args
    528     )

AttributeError: 'float' object has no attribute 'ndim'

File ~/miniforge3/envs/calliope/lib/python3.12/site-packages/sparse/_umath.py:545, in _Elemwise._get_fill_value(self)
    542     fill_value = fill_value.astype(self.dtype)
    544 self.fill_value = fill_value
--> 545 self.dtype = self.fill_value.dtype

AttributeError: 'float' object has no attribute 'dtype'

It works fine for operations on purely numeric data, just not when working with object arrays.

irm-codebase commented 2 months ago

@brynpickering after messing around with the backend, I see what you mean... this one will be tough.

The big issue here is that np.nan consumes the same amount of memory as a float (64 bits), and we currently have nan for stuff like carriers (i.e., adding a new carrier will add nans for all cases with that dim). So, adding a single heat technology to an electricity only model may bloat some parts by twice the amount of memory.

Pyomo does support sparsity natively... but I do not yet know the backend well enough to know how/what needs to change.

Algorithmically, a solution should be possible if we have full determinism when flattening matrixes. i.e., we always know the following:

the maximum number of dims combinations.
the order in which dims will be added in relation to each other.
the order in which elements in each dim will be added.

Not sure if we currently have this, but it should allow us to lazily fill in sparse vectors and then drop them to the backend. At least in theory...

brynpickering commented 2 months ago

We're already effectively using dense arrays as far as Pyomo is concerned. It's the application of operations across N dimensions (incl. broadcasting capabilities) that we benefit from by also representing those objects in NaN-filled arrays. We came from this full determinism in v0.6 to what we have now because it is very messy to ensure this in a generalised way, especially when math components don't share the exact same dimensions.

irm-codebase commented 2 months ago

Yep, I understand that the issue is not in the backend. The increase in memory will only happen on our side... As far as a generalized algorithm... would something like this work? I've used something like this for similar problems.

import itertools

dims = ["techs", "nodes", "steps", "carriers", "foo", "bar", "perrito"]
all_unique_combinations_sorted = set()
for i in range(len(dims)):
    for group in itertools.combinations(dims, i+1):
        all_unique_combinations_sorted.add(".".join(sorted(group)).lower())
all_unique_combinations_sorted.add(["GLOBALS"])  # no idea if this is even needed

This contains all possible combinations of our dimensions, always in order (127 + GLOBALS in this example). Create a dict with them, and store a sparse xarray in it. Add flattened vector data to it.

Lookups are easy: just sort and join the dimensions you want. Similarly, the large number of keys does not matter too much. Sparse data has little memory impact by design, and you can easily erase empty combinations if you wish.

Would this work, or am I saying something silly?

brynpickering commented 1 week ago

I'm closing this as not an issue that we plan to address, given how small our datasets are (even with all their NaNs) compared to peak memory consumption in the optimisation step (see this comment).

irm-codebase commented 5 days ago

@brynpickering I'm ok in closing this.

Some closing remarks: my worry actually relates to the size of the optimisation problem. If we have a NaN parameter in the input data, there are two things that could happen:

The backend still has to go through it when constructing the model.
Depending on our checks, this NaN might trigger the generation of 'unused' constraints and variables, which will bloat the end model without affecting results.

However, I should've done more checks on this, so I will only re-open this if I identify it as a problem that reaches the actual optimisation.

calliope-project / calliope

Use of native sparse array support in xarray / pandas / netCDF #634

What can be improved?

Version