Setting the table - Githubissues

visr commented 1 year ago

TODO

[x] https://github.com/Deltares/Ribasim.jl/pull/65
[x] update the docs
[x] see if we can avoid the DataFrames dependency, (groupby could be replaced by other queries for GPKG, relying on sorted IDs for Arrow, and searchsorted(ids, id)
[x] generalize the forcing update code for non-LSW nodes (@visr)
[x] "build back better" the validation (@evetion)
[x] add "active" column to node table, and support using that to run sub-models

Last week @Huite and I talked about restructuring our tables, I'll try to summarize our ideas so others can comment.

Forcing

Right now the forcing table is in long format, with 4 columns: time, id, variable, value. It needs to be sorted by time. It's nice to have just one table, but it is also a bit odd to have to all the the variables spliced together like that. A user working on a model is probably thinking from the perspective of a variable that it wants to update or visualize. To get direct access to a variable, it needs to be its own column. Another downside of putting variables together is that you might want to mix different types, like Int, Float32 and Float64. Although Arrow supports unions, having a column per variable solves that as well.

To give an example, we would transform this current long layout:

time	id	variable	value
2019-01-01T00:00:00	6	P	2.89352e-10
2019-01-01T00:00:00	7	P	2.89352e-10
2020-12-31T00:00:00	6	E_pot	1.15741e-9
2020-12-31T00:00:00	7	E_pot	1.15741e-9

To this wide layout:

time	id	P	E_pot
2019-01-01T00:00:00	6	2.89352e-10	1.15741e-9
2019-01-01T00:00:00	7	2.89352e-10	1.15741e-9
2019-01-02T00:00:00	6	2.89352e-10	1.15741e-9
2019-01-02T00:00:00	7	2.89352e-10	1.15741e-9

However, some variables get updated daily, and others yearly or irregularly. Having one shared time column limits flexibility here. To address that, variables can be grouped together in a table if they share the same time column. So a variable like target_level can be separate from the daily meteo forcing table, as shown below. It would also work to keep it together in the daily table with missing data, if that is more convenient for a particular use case.

time	id	target_level
2019-01-01T00:00:00	6	1.1
2019-06-01T00:00:00	6	1.3
2020-01-01T00:00:00	6	1.1
2020-06-01T00:00:00	6	1.4

Note on column names In this setup the data column names like P must be unique for a variable in a particular node type. However it might make sense for different node types to have a precipitation variable, and sharing a common namespace is not ideal. So either the column names should include both node type and variable, or tables should be separated per node type. The ID can be used to look up the node type as well, but for processing purposes it would be nice to have that readily apparent.

State, static and water balance

Both of these can be easily converted to wide format like the forcing table, but without time.

Profile

Right now profiles contain volume, area, discharge and level. We should split this into two. The volume, area and level describe the elevation and vertically integrated geometry of the basins. The level and discharge together describe the rating curve of flow over a weir.

visr commented 1 year ago

After discussing this with the group, below is the approach I think we agreed on. I added a few details that we didn't discuss, so please read and comment.

GeoPackage

By default all model input data is together in one model.gpkg. A GeoPackage is a SQLite database, with good editing support in QGIS. The node.arrow and edge.arrow become spatial tables in the GeoPackage. The state.arrow, static.arrow and profile.arrow become non-spatial tables in the same GeoPackage file. They will be split into separate tables by node type, such that a schema can be enforced per node type. State is currently only kept by the LSW node, but in the future more node types can have state. Profiles will be split into profiles and rating curves as described in the post above.

The forcing.arrow is also split by node type, and is stored as a non-spatial temporal table, that should be sorted on time.

Parquet

Sometimes it can be slow or inconvenient to have tables in a GeoPackage. Therefore we support all tables except the node and edge (which define the network), to be stored outside the GeoPackage. For this we support the Parquet file format. For instance, forcing data may get so large that GeoPackage is no longer a suitable choice. If the forcing is put in a Parquet dataset we will benefit for instance from fast columnar compression. I use the term Parquet dataset here instead of Parquet file, since a table can be partitioned into multiple files. For instance each year or each ID could be stored in a separate file. As another example, storing the state in a Parquet file will make it easier to swap out different initial conditions, or store only the state in a Delft-FEWS setup. We also support having the tables for some node types in the GeoPackage and others in Parquet files. So you can have a "forcing/lsw.parquet" and have forcing for other node types in the GeoPackage. However each table needs to be completely in one or the other. If we use a Parquet file for a table, it is still allowed to be in the GeoParquet file, but will be ignored.

The waterbalance.arrow will become also become one Parquet file per node type. There should be another optional output for the state: outstate.parquet.

For an example of the forcing tables, here are two examples, one for the LSW node type, and one for LevelControl.

LSW

time	id	p	et
2020-01-01	1	0.684735	0.050485
2020-01-01	2	0.693355	0.577703
2020-01-02	1	0.976873
2020-01-02	2	0.763088
2020-01-03	1	0.613714	0.663129
2020-01-03	2	0.20232	0.599062

LevelControl

time	id	target_level
2020-01-01	3	0.522607
2020-01-01	4	0.038752
2020-01-01	5	0.330723
2020-06-01	3	0.621447
2020-06-01	4	0.697511
2020-06-01	5	0.341952

The LSW node type has more variables and updates more frequently. Node types that are static need not have a forcing table at all. If a subset of the variables is updated, the others are missing. In this example et is only updated every second day. The id matches the ID in the node table, and is unique over all node types. Before loading the forcing into the model, the id needs to be converted to an index into the corresponding parameter vector.

TOML

Retain config.toml, with the cases.toml from #47. The config.toml holds links to all data. The path to the GeoPackage needs to be added, for instance with database = "model.gpkg". The table names in the GeoPackage will be hardcoded. If a table is not in the GeoPackage but in a Parquet dataset, this needs to be indicated, remembering that the tables are split by node type:

[LSW]
state = "state/lsw.arrow"
forcing = "forcing/lsw.arrow"

If a directory is given this is assumed to be a multi-file dataset. Inside the GeoPackage the forcing tables have to be named like state_LSW and forcing_LSW.

evetion commented 1 year ago

Thanks for the writing this! I like the overall direction. Some comments, mainly dealing with keeping things simple (KISS) and consistent.

Therefore we support all tables except the node and edge (which define the network), to be stored outside the GeoPackage.

Why make the exception here and introduce more code complexity? With the toml, we should be able to override any table.

The waterbalance.arrow will become also become one Parquet file per node type. There should be another optional output for the state: outstate.parquet.

We never discussed the output explicitly, but doesn't it make sense to write the output into a single geopackage (with the optional outstate table), just like the input?

if a directory is given this is assumed to be a multi-file dataset.

There could just be a single forcing_lsw.parquet somewhere, so we don't deal with multi-file datasets? Since we already split per nodetype and moved to parquet, I don't expect any performance issues related to huge tables and just more data/code complexity/checks. We could always add it when it turns out we need it as an extra (non-breaking) feature.

visr commented 1 year ago

Just discussed with @evetion.

Regarding the node and edge exception, the main reason is that these are the only two spatial tables, and spatial is special. Also most validation will be against these tables, and that might be easier to do in the same database.

For the output files, we can stick to only Parquet output for now, since output is write-only, editing doesn't have to be supported. Also they can get big, so Parquet would be a better match.

We allow multi-file datasets but should perhaps be extra clear in logging as more can go wrong when parsing directory listings. But since Parquet datasets generally can be in either single or multi file form, it is probably better to support both. Both are read with the same function, Parquet2.Dataset.

visr commented 1 year ago

Thinking some more about the last point, multi-file Parquet dataset support, @evetion is right that we probably don't really need it. So let's build it for single files only, and when that is all working well we can do some test to see if multi-file support works well with that.

visr commented 1 year ago

We only briefly discussed Parquet vs Arrow. After playing around with both and reading a bit more I think that we should stick to Arrow besides the GeoPackage.

A reason for Parquet was that it often is mentioned as a logical file format companion to Arrow in memory data, good for archival purposes, to as mentioned by Wes McKinney. The Arrow docs on versioning call format major changes an exceptional event if they would happen, and the FAQ states:

The Arrow columnar format and protocol is considered stable, and we intend to make only backwards-compatible changes, such as additional data types. It is used by many applications already, and you can trust that compatibility will not be broken.

In this Ursa Labs blog post there are benchmarks comparing Arrow to Parquet. Arrow is generally faster to read and write, at the cost of higher file size.

When I compare Snappy compressed Parquet LHM forcing data to lz4 compressed Arrow files, the Arrow file is only about 10% larger. If I use zstd compression, default level, it is 30% smaller than Parquet.

If I load the forcing table into a struct similarly to how we do that in the Ribasim.jl callback, Arrow is faster than Parquet by about 50%. Both are probably minor effort compared to computations.

Reading the Parquet2.jl dev notes and FAQ the author clearly suggests to use Arrow for our use case:

If your format is tabular, you probably want to use Arrow.jl. The reason is that the arrow format is much closer to a natural in-memory format and as such will be much less computationally expensive and require far fewer extra allocations than parquet in most cases.

One last reason is that for Arrow we already have Legolas.jl validation built in, and generally there seems to be more julia community interest in using Arrow.

evetion commented 1 year ago

Agreed on staying with Arrow.jl for now. But some comments;

If I load the forcing table into a struct similarly to how we do that in the Ribasim.jl callback, Arrow is faster than Parquet by about 50%. Both are probably minor effort compared to computations.

So what's the absolute performance difference? Because I expect it to be negligible.

One last reason is that for Arrow we already have Legolas.jl validation built in.

But it operates on generic a Tables.schema, so it would work with Parquet2 just fine.

visr commented 1 year ago

So what's the absolute performance difference? Because I expect it to be negligible.

For the test I ran it was 1.4 vs 0.7 seconds or something small like that.

visr commented 1 year ago

A small addendum after discussing with @Hofer-Julian; let's not allow any extra columns other than the ones we require, at least until there is a clear use case for it. For filling the GeoPackage, it makes sense to have tooling that creates the known schemas before adding the data. This will help with some of the validation as well. Here is some julia code for creating some of the empty tables with a schema using SQLite.jl, written for but removed from #56.

# create an empty table based on a schema
schema_state = Tables.Schema((:id, :S, :C), (Int, Float64, Float64))
SQLite.createtable!(db, "ribasim_state_LSW", schema_state)

schema_static_levelcontrol = Tables.Schema((:id, :target_volume), (Int, Float64))
SQLite.createtable!(db, "ribasim_static_LevelControl", schema_static_levelcontrol)
schema_static_bifurcation =
    Tables.Schema((:id, :fraction_1, :fraction_2), (Int, Float64, Float64))
SQLite.createtable!(db, "ribasim_static_Bifurcation", schema_static_bifurcation)

schema_lookup_lsw =
    Tables.Schema((:id, :volume, :area, :level), (Int, Float64, Float64, Float64))
SQLite.createtable!(db, "ribasim_lookup_LSW", schema_lookup_lsw)
schema_lookup_outflowtable =
    Tables.Schema((:id, :level, :discharge), (Int, Float64, Float64))
SQLite.createtable!(db, "ribasim_lookup_OutflowTable", schema_lookup_outflowtable)

schema_forcing_lsw = Tables.Schema(
    (
        :time,
        :id,
        :demand,
        :drainage,
        :E_pot,
        :infiltration,
        :P,
        :priority,
        :urban_runoff,
    ),
    (
        DateTime,
        Int,
        Union{Missing, Float64},
        Union{Missing, Float64},
        Union{Missing, Float64},
        Union{Missing, Float64},
        Union{Missing, Float64},
        Union{Missing, Float64},
        Union{Missing, Float64},
    ),
)
SQLite.createtable!(db, "ribasim_forcing_LSW", schema_forcing_lsw)

Deltares / Ribasim

Setting the table #54

TODO

Forcing

State, static and water balance

Profile