Performance and add-API: "Vectorized" form?

Huite commented 8 months ago

Maybe somewhat of a random thought. The add-api looks very good to me, and will help enormously in setting up models. One downside is that it calls pd.concat() once per added element. I'm pretty sure that concat allocates an entirely new dataframe. My mental model for dataframes is that it's essentially a dict[str, numpy.array]-- this isn't quite accurate because pandas has a block manager and whatnot, but it's close enough.

Anyway, a numpy array cannot be resized, and it will be reallocated and so it means adding many elements will have quadratic time complexity rather than linear. If you want to add 10_000 nodes at once, this is obviously wasteful.

Bad solutions:

Just build the dataframe yourself for N elements, then call pd.concat yourself. This obviously works fine already, but still puts all the effort at the user.
Do something "delayed". You could try to be fancy with dask dataframes, or you could maintain an internal list of rows "to add". This obviously complicates the internals way too much, because you never know whether your input is actually a dataframe proper, or a dataframe-in-the-making.

Good solution:

Add methods to add many elements at once. Create one dataframe with many rows, and call concat once. It provides the same convenience, except you provide it list[NodeData] for example. (Add a type check for the single versus multiple methods.)
In edge construction, geometry construction can also be vectorized with the shapely.linestrings() constructor.
Something to consider: the data to add may already exist in dataframe form. That should be easily doable, since you can easily convert a dataframe to a list of named tuples.

# where dataframe is some earlier existing dataframe, e.g. from a different model.
basin_data = [
     [basin.Profile(**rows.to_dict()) for _, rows in dataframe.groupby("node_id")],
     ...  # abbreviated
]
model.basin.add_many(node_ids, geometry, basin_data)

Hofer-Julian commented 8 months ago

you could maintain an internal list of rows "to add". This obviously complicates the internals way too much, because you never know whether your input is actually a dataframe proper, or a dataframe-in-the-making.

Yeah, I suggested something similar to @visr the other day. We for sure need to benchmark a lot before looking into this. I don't think it would complicate internals too much, but we would lose the ability to set dataframes directly.

Hofer-Julian commented 8 months ago

Add methods to add many elements at once. Create one dataframe with many rows, and call concat once. It provides the same convenience, except you provide it list[NodeData] for example. (Add a type check for the single versus multiple methods.)

Sounds good to me. We should make clear that the main reason to use that API is performance.

Deltares / Ribasim

Performance and add-API: "Vectorized" form? #1245