Deltares / Ribasim

Water resources modeling
https://ribasim.org/
MIT License
39 stars 5 forks source link

Design Python API #588

Closed evetion closed 8 months ago

evetion commented 1 year ago

The problem is, there's no clear design, it's mix of different abstractions. Pandera DataFrames are high-level abstract things, but represent low-level tables (which can come from arrow or geopackage). There's a high-level abstraction of a nodetype in between, but it doesn't represent something low-level. The current Model is a mix of a representation of these nodetypes with a (partial) low-level config. Will make a separate issue about this, which needs a refinement.

Originally posted by @evetion in https://github.com/Deltares/Ribasim/issues/580#issuecomment-1715360255

evetion commented 1 year ago

Some thoughts:

Requirements

Ideally we have a clean and extensible low-level design (representing config files/tables on disk) that makes it easy to build a user-facing higher level design, which abstract things away like nodeids and the separate node tables. It would also be possible to iteratively start a model (from scratch) and add nodes one by one. This would also make some tests a bit easier, as one could do BasinStatic(id=1, foo=bar) and add it, instead of making numpy vectors.

Current issues

So the Config file is quite easy to represent, as it is autogenerated, and quite straightforward. The issue is with representing tables, as a single geopackage class doesn't work (can also be arrow files). The nodetype abstraction doesn't exist in the files, we essentially have a loose collection of database tables, but these are not represented as such (but as DataFrames). Sort is now defined on a thing like Basin, which has multiple sort statements for each DataFrame. Everything else is done in the Model class, often reaching down to check things on its children (https://github.com/Deltares/Ribasim/blob/main/python/ribasim/ribasim/model.py#L360-L363), instead of letting the children be responsible. It basically is not SOLID.

Low-level design

We might try to go with the hydrolib design (by me 😅, see https://deltares.github.io/HYDROLIB-core/0.5.2/topics/principles/). It essentially represents each file on disk as a (Pydantic based) class, which link to one-another by the attribute that normally holds a filename. So to access the static basin table, one would use Config().Basin.static (normally pointing to a filename).

Not sure how the above design can use autogeneration, because these classes require some customization.

Hofer-Julian commented 1 year ago

tables, but these are not represented as such but as DataFrames

Can you please explain what's the difference between a table and a DataFrame in that context?

visr commented 1 year ago

Thanks for the writeup. One thing to add to the requirements is that it should easily scale to 1e4-1e5 nodes. I never really profiled, but I've had to wait 10s for writing a 1000 node model, which IMO is too much, that should be <1s.

Linking some issues that may be relevant:

evetion commented 1 year ago

tables, but these are not represented as such but as DataFrames

Can you please explain what's the difference between a table and a DataFrame in that context?

Ah yeah, I was unclear. I meant these are file based database (or arrow) tables (edited the above comment to reflect that) and not a generic in-memory table format for data analysis.

While similar, I like to separate the concerns (each class with small interfaces). So a file based table class ideally only has a link with their source file and related read/write methods, maybe with a parent class having methods related to unique ids. The interface of a DataFrame is huge and apart from some I/O methods, doesn't give us what we need. Ideally we wrap them, or allow them as input/output.

evetion commented 1 year ago

Thanks for the writeup. One thing to add to the requirements is that it should easily scale to 1e4-1e5 nodes. I never really profiled, but I've had to wait 10s for writing a 1000 node model, which IMO is too much, that should be <1s.

Linking some issues that may be relevant:

* [ ]  [Start supporting pydantic v2 #382](https://github.com/Deltares/Ribasim/issues/382) not sure if needed, but let's design for the future (v2)

Pydantic V2 will help with that, as it is much faster (core rewritten in Rust). Then again, I'm not sure if we can improve your 1e5 nodes write if it depends on geopackage writing. And I think we also have the usecase of a few million rows for timeseries, which could be way slower to read/write?

visr commented 1 year ago

Yes the 1 second target is for a normal size (~1000) node model, without very long timeseries.

For timeseries with millions of rows we should recommend Arrow instead of GeoPackage anyway. So if performance is bottlenecked on IO performance of C/C++ libraries that would probably be a good thing, because that would mean it is not spending its time on converting each row back and forth to a dict first, as mentioned in https://github.com/Deltares/Ribasim/pull/178#issuecomment-1517540061.

Hofer-Julian commented 1 year ago

Discussion with @evetion and @visr:

Requirements:

visr commented 1 year ago

I profiled a script reading and writing a big model:

import ribasim
model = ribasim.Model.from_toml("data/nl/nld/nl.toml")
model.write("data/nl/nld-test")

Read and write each take almost 30 seconds when not running the profiler.

The profile data is profile.zip, and a screenshot of snakeviz tmp.prof is here:

image

This shows that almost all time is spent in validation. For reading that was expected, but for writing not so much. Based on the profile this is triggered by calling sort on the dataframe before writing. It would be nice if validation could be done faster.

evetion commented 12 months ago

This shows that almost all time is spent in validation. For reading that was expected, but for writing not so much. Based on the profile this is triggered by calling sort on the dataframe before writing. It would be nice if validation could be done faster.

The slow write (sorting triggered validation) was fixed by #597.

evetion commented 12 months ago

I got positive feedback on the use of Pydantic and the overall design of Hydrolib, noting that performance has not been an issue, except for parsing/serializing very large ascii files, unrelated to the architecture.

Therefore, I propose to roughly follow the same structure, having an API that closely follows the file formats (especially the config file). The difficulty is in having auto-generated code while customizing the classes extensively, and handling topology/id-relationships from otherwise ideally independent tables.

Implementation

handle = ContextVar(DataBaseHandle)

class Model(FileModel): # mirroring all attributes of config.toml
    output: Output
    basin: Basin
    # other nodetypes
    solver: Solver
    logging: Logging

    # Unique id counter
    _id: int

    # Hidden Node en Edge classes
    _node: Table[Node]
    _edge: Table[Edge]

    """Initialize empty or from file"""
    def __init__(self, fn=None):
        # get/set handle for recurring database operations
        # prevents opening/closing the db many times
        handle.set()

    """Write to file(.toml)/folder(all?)"""
    def write(self, fn):
        pass

    """Run the model."""
    def run(self, fn):
        pass

class Basin(NodeModel):
    time: Table
    # all other tables(!)

    def add_time(row: PydanticModel, geometry=None):
        pass

# Could also be an ArrowTable, changing read/write
class Table():
    rowtype: SomeSchema
    df: DataFrame[SomeSchema]
    _io: Union[Arrow, GeoPackage]  # or inhereted from parent?

    """Initialize empty or from file"""
    def __init__(self, fn=None):
        handle.get()

    def write(self, fn):
        self._io.write(self)

Usage

Note how all follows from the config file, assuming end-users normally edit it.

# from scratch
model = Model("test", starttime=foo, endtime=bar)
# or from file
model = Model("config.toml")

model.basin.time.df # Gets you a DataFrame, following the config "path" to get there

# Time is actually a bit nasty, given it has multiple rows for the same id.
bt = BasinTime(time=now(), level=1)  # individual row construction
nbt = model.basin.add_time(bt, geometry=polygons)  # adds a row, triggers validation/node ids/etc
nbt.connect(other_node, geometry=linestrings)  # make an edge?

model.write(filename) # writes all files
Hofer-Julian commented 12 months ago

Will this design still use the pydantic backend of pandera considering its inefficient way of validating row-wise?

evetion commented 12 months ago

Will this design still use the pydantic backend of pandera considering its inefficient way of validating row-wise?

I'd say that's an implementation detail, it should validate, and ideally be fast, but that doesn't change the API. To fix the performance, we probably need to transpile into a pandera schema. I still like to keep using both pandera and pydantic in the package.

Hofer-Julian commented 12 months ago

Agree on all three statements.

evetion commented 12 months ago

Could use a comment on how to work object/row based with things like time/profile. Those are individual rows, with repeated node ids. Do we need a new groupeddrow object? Like BasinTimes(time=[now(), now()], level=[1,2])?

evetion commented 11 months ago

Requirements for this package:

@visr Can you provide input on the prevalence of these use cases?

evetion commented 11 months ago

Will rewrite the example at https://deltares.github.io/Ribasim/python/examples.html to the new API for final comments before starting to write the Python code.