Deltares / Ribasim

Water resources modeling
https://ribasim.org/
MIT License
42 stars 5 forks source link

Performance and add-API: "Vectorized" form? #1245

Open Huite opened 8 months ago

Huite commented 8 months ago

Maybe somewhat of a random thought. The add-api looks very good to me, and will help enormously in setting up models. One downside is that it calls pd.concat() once per added element. I'm pretty sure that concat allocates an entirely new dataframe. My mental model for dataframes is that it's essentially a dict[str, numpy.array]-- this isn't quite accurate because pandas has a block manager and whatnot, but it's close enough.

Anyway, a numpy array cannot be resized, and it will be reallocated and so it means adding many elements will have quadratic time complexity rather than linear. If you want to add 10_000 nodes at once, this is obviously wasteful.

Bad solutions:

Good solution:

# where dataframe is some earlier existing dataframe, e.g. from a different model.
basin_data = [
     [basin.Profile(**rows.to_dict()) for _, rows in dataframe.groupby("node_id")],
     ...  # abbreviated
]
model.basin.add_many(node_ids, geometry, basin_data)
Hofer-Julian commented 8 months ago

you could maintain an internal list of rows "to add". This obviously complicates the internals way too much, because you never know whether your input is actually a dataframe proper, or a dataframe-in-the-making.

Yeah, I suggested something similar to @visr the other day. We for sure need to benchmark a lot before looking into this. I don't think it would complicate internals too much, but we would lose the ability to set dataframes directly.

Hofer-Julian commented 8 months ago

Add methods to add many elements at once. Create one dataframe with many rows, and call concat once. It provides the same convenience, except you provide it list[NodeData] for example. (Add a type check for the single versus multiple methods.)

Sounds good to me. We should make clear that the main reason to use that API is performance.