alan-turing-institute / DTBase

A starting point from which digital twins can be developed.

MIT License

11 stars 4 forks source link

Discussion of relational database schema #131

Open mhauru opened 1 year ago

mhauru commented 1 year ago

A place for us to collect our discussion on how to possibly change the database schema to be more flexible.

mhauru commented 1 year ago

Pat's notes from our latest chat: https://docs.google.com/document/d/1o8vGaGX6xcMTGS6ndvYF47RfMEeq6Hp6bQqKB0k-KFw

mhauru commented 1 year ago

One of my thoughts, with Pat's questions and my answers:

I was thinking more about this database design. Here's one idea of how it might work, adapting Pat's design, with three tables: Table: Object (this one is both object and template in Pat's language) Columns: name, arbitrary JSON metadata Examples: The concept of temperature, the concept of temperature sensor, temperature sensor #15, the concept of location, location at coordinates (1, 5), task to do a thing by Friday Table: Relation type Columns: name, JSON metadata including a schema for metadata that relations of this type must have Examples: "is instance of", "sensor is at location", "sensor type records measure" Table: Relation Column: relation type id, object 1 id, object 2 id, JSON metadata that conforms to schema of the type Examples: "temperature sensor #15 is an instance of the concept of temperature sensor", "temperature sensor records temperature measurements", "sensor #15 is at location #23" Table: Data Columns: object id, timestamp, JSON data. Maybe additional columns for raw floats, ints and bools if it helps (performance?), otherwise all data in the JSON column. Examples: "reading from sensor #15 at datetime blahblah is 15.2" Why did I fold both objects and templates into a single table? Because once I started thinking of relations I quickly realised I want relations between templates too. I think the object, relation type, and relation tables together form an ontology or a knowledge graph or a semantic web (I'm unclear on the terminology here). A known, common data structure. Why not fold the data into the object table as well? Because I suspect that would make queries very slow. The point of the data table is that the vast masses of data go in that one, the object table should probably only have 10-1000 rows ever, data can have millions. I'm not quite sure why I want everything to have a JSON column for "metadata". It's kinda a contingency plan, because you can dump any extra stuff that doesn't fit nicely anywhere else there. Maybe they are not needed for e.g. relation or relation type. Something that this doesn't do well yet is that I don't know how to associate several objects to a data point (how do I say that a reading is from sensor #15 and of measure "temperature"), nor data points to each other (maybe there are several values coming out of a single sensor reading, and they should be linked). You could always work around this by putting more stuff in the JSON field (you can put anything into JSON...), but that doesn't seem like a very elegant or efficient solution. I'm also considering a concept of data batches, either to be put in the objects table or in a separate table, but it doesn't seem infinitely elegant either. On a related note, I can think of the following three criteria for any database design we consider: Easy to write queries for. Fast to execute queries. Flexible enough to accommodate any needs that may arise. Am I missing some fourth criterion, or is that it?

I went through Markus' idea yesterday I have some questions and proposals. Let's discuss them on Friday but it would be good if you guys had a look through it if you have time. Comment: I like that there is this 'ontology that emerges'. I like that this allows for relations between objects. Questions: How would the look up through the relation table look like? Say I have object1 related to object2 (stored in the objects table). If I want to ask: given object1, what is the other object associated with it? Would I have to look through the 'relations' table to find a relation that has object1 in it then see that it is also attached to object2? Might be better somehow to include the relation in the object table itself so you could look up object1 and the relation to object2 is already given. How would constraints be applied? Previously 'templates' constrained 'objects'. eg. metadata of 'DHT22' must be of the form '{temperature: float, humidity: float}', ie. only 'data' with the name 'temperature' and 'humidity' can be associated with objects of type 'DHT22', etc. If objects can be related to objects, how would we go about specifying this constraint? - Endpoints. If I have a hierarchy of objects like this: object1 -> object2 -> object3 -> data, how would the endpoint querying experience look like? would I have to query for object1, see it is related to object2, query object2, see it is related to object3, query object3 and find the data associated with it? Compare that to say another allowed hierarchy of objects: object4-> data, the querying experience seems quite different and making a standard easy to use querying experience might be difficult. Proposal: I'm not sure how the answers to these questions might look like in your proposal, but I have included it in my new proposal. I have included the idea of relations and ontology from your idea. I also included a way to associate multiple objects to a single datapoint. I only have one 'relation_type' currently but this can be extended simply - let's discuss.

We can talk more on Friday, but I'll write here short answers to two of the questions. How would the look up through the relation table look like? Say I have object1 related to object2 (stored in the objects table). If I want to ask: given object1, what is the other object associated with it? Would I have to look through the 'relations' table to find a relation that has object1 in it then see that it is also attached to object2? Might be better somehow to include the relation in the object table itself so you could look up object1 and the relation to object2 is already given. Yeah, you would do a query with two joins: object table, joined with relation table to find all the relations of object1, joined again with the object table to find object2 at the other end of that relation. I don't think this would be a problem, postgres can make these queries very fast, and this is the standard way of implementing a many-to-many relation between two tables (or in this case a table with itself). The problem with going directly from object1 to object2 in the object table is that you would either 1) only have many-to-1 and 1-to-1 relations, if each object row has a column for the object it is related to, or 2) not be able to implement these relations as foreign keys, if you have a column that lists multiple objects object1 is related to. The benefit of using foreign key constraints is that it makes it harder to insert invalid relations (e.g. referencing an object that doesn't exist) and makes many queries easier because postgres is very well equipped to follow foreign key relations and join data over them, automatically do things like if you delete a sensor type it automatically deletes all sensors related to it, etc. How would constraints be applied? Previously 'templates' constrained 'objects'. eg. metadata of 'DHT22' must be of the form '{temperature: float, humidity: float}', ie. only 'data' with the name 'temperature' and 'humidity' can be associated with objects of type 'DHT22', etc. If objects can be related to objects, how would we go about specifying this constraint? Relation types could be marked as either enforcing or not enforcing constraints. So for instance, you would define a relation type called "is a template for", and then the general concept of DHT22 sensor would be related to DHT22 sensor #23 by that relation. And then you would say that "is a template for" is a relation for which object2 (the instance) has to have its JSON metadata follow a structure specfied in the JSON metadata of object1 (the template). These constraints would still have to be enforced in application level code (or maybe postgresql triggers) just like they would if template and object were separate tables. Note also that you could specify constraints of other kinds that don't relate to the metadata. For instance, here are a couple of constraints you might imagine wanting: Any object X that has a relation of type 'is instance of' to the object 'concept of sensor' (i.e. X is a sensor) must also have a relation of type 'is located at' to some object Y, that in turn has a relation of type 'is instance of' to the object 'concept of location' (i.e. Y is a location). In other words, every sensor must have a location. Not sure how to write these sorts of constraints in the JSON metadata of one of the tables (not even sure which table I would try to put them in). It might be doable, but might also be that we shouldn't try to specify constraints in JSON, but somewhere outside the database. The question about endpoints is a big (and good) one, I'll have to think about this and we can talk on Friday. The question that most bothers me about endpoints: If we make a really general and flexible database design, then any user needs can be accommodated by it. No user will want to think in super abstract terms though, so then we have to design nice and understandable API end points that hide the abstract nature of the database design. But if we want to make the API end points nice and understandable, then they won't be as general and flexible any more. So then if someone wants to add new functionality, they would need to write new API end points, even if they wouldn't have to write new database code. But if, to add more functionality to the platform, you have to go write API code, then is that really much easier than also having to write database code? In which case, what have we gained by this extreme generality and flexibility of the database design?

mhauru commented 1 year ago

I would like to start a list of scenarios that any database schema design should be able to serve. I'll list them in this comment. If you want to add new ones/make edits, leave a comment and I'll edit this message. That way we have a single place that lists all the scenarios.

The point of these scenarios is not be exhaustive of the kinds of things we would like to do, nor to be fully realistic. The point is to stretch any proposed schema, see if it can accommodate different types of needs.

Ideally I would like to work out for every schema proposal we are seriously considering and for every scenario, the answer to the following questions:

How is this data stored in the given schema?
What do the necessary queries look like in the given schema?

Optimally the answers to question 2 would be literal SQL queries. I've written down a lot of possible queries so we probably don't want to write actual SQL for all of them, but I would like to spell out at least the trickiest ones very explicitly, it often exposes details I wouldn't think of otherwise.

Scenario 1: Temperature and humidity data

There are three types of sensors: Temperature sensors (TS), humidity sensors (HS), and combined temperature-humidity sensors (THS) that record both.
There is one sensor of each of these types, call them TS1, HS1, and THS1.
TS1 is located in room "hallway" shelf 5, THS1 is located in room "A" shelf 23 but later moved to hallway shelf 17, and HS1 does not have a location set.
Each one produces a steady timeseries of temperature and/or humidity data.
Queries we would like to run:
- Obvious basics: Create sensors, add locations, add sensor data
- Move sensors. The history of previous locations should be kept.
- Give me the average temperature reading of all the sensors in room "hallway" within a given time window.
- Give me the timeseries of all the temperature readings from all sensors that record temperature within a given time window.
- Give me the location history of a sensor.
- List all sensors of a given type that are currently in room "hallway".

Scenario 2: TH model data

There are two different forecasting models, called Arima and GES.
Both are being run once a day, and each run produces a time series of predictions.
Arima is predicting temperature readings for the room "hallway" for the next week.
GES is predicting both humidity and temperature readings coming out of the sensors THS1 (see scenario 1) and THS2 for the next week.
GES also outputs confidence intervals for all its predictions, as a time series. So really the output of GES is six time series: Upper bound, lower bound, and mean, for both temperature and humidity.
Every run of GES is also done for multiple values for a parameters called "ventilation rate" and "lighting schedule", 100 different pairs of values in total, and each of these produces its own time series of results. So really the output of a single run of GES is 600 times series, the above 6 but for all 100 combinations of parameters.
Each run of a model comes with the following data:
- When was the model run
- How far back in time did it look to fit itself
- What is the timespan that it's predicting for
- For GES, what's the value of "ventilation rate" and "lighting schedule" for this run of the model
Queries we would like to run:
- Create the models GES and Arima
- Insert a new run of each model, with its parameters and results
- Get the latest predictions from each model as a timeseries, for GES specifying whether we want THS1 or THS2 predictions and which values the parameters should have. For GES the return value is 6 time series (bounds and mean for temperature and humidity), for Arima its just the mean temperature time series.
- Get the Arima predictions for temperature in the hallway for 2023-11-14 08:00 for all the model runs that predicted this value (probably seven values, for having run the model the day before, two days before, etc.).
- Get the highest value for the upper bound prediction of humidity for GES, for both THS1 and THS2, from the latest set of predictions. The output should be two numbers, the highest value for THS1 and for THS2.

Scenario 3: Crop batch data

The farm has batches of crops growing in various parts, and its recording their progress through the farm and the yield they produce.
There are growing recipes, crop types, batches, trays, and slots.
A batch is always grown in a particular tray, but trays move around different slots.
Slots have locations, recorded using the same system of keeping track of locations of sensors, namely room name and shelf number.
Batches come with information about which crop type they have and which growing recipe they are using.
A batch is first created as a planned one, before sowing, and given a crop type, a recipe, and a sowing date.
When a batch is sown it is associated with a tray. The tray is first in the propagation section for some time.
The tray is then moved to the main growing section, and associated with a slot that it's in.
The tray is later taken out of the slot and the batch is harvested. Yield is recorded.
Sometimes a tray may be moved between two slots while in the main growing section.
Queries we would like to run:
- Record all of the typical events: Planning, sowing, moving, harvesting
- Get all the data related to a batch: Where was it, when, which recipe, how much was the yield, etc. Different values would be returned depending on where in the life cycle of a batch we are (e.g. maybe this one hasn't been harvested yet).
- Get the average temperature of all the sensors in the room where the batch was during its time in the main growing area, averaged over both the sensors and the time window of this batch being in the room.
- Find the ID of the temperature sensor closest to a given batch right now.
- Find out what's growing in a given slot at a given time.
- List all batches of a given crop type that have been sown but not harvested.

Scenario 4: Warnings

This scenario refers to some of the above ones when it comes to crop and sensor data.
Various events can trigger warnings. There are three types of warnings:
- Temperature too high, as reported by some sensor. Comes with a timestamp, a temperature reading, and a sensor ID. Triggered when the temperature reading goes above some value. The threshold can be set for each sensor separately. The threshold values are stored in the database as well.
- Batch of crop overdue for harvest: Comes with timestamp and batch number. Based on looking up the grow recipe of this batch which says how long the batch should stay in main growing area, and comparing to how long its been there now.
- Sensor reporting little or no data: Comes with a timestamp, sensor ID, time period during which data is missing, and missingness ratio. Triggered if over the last 24h a sensor has reported less than 50% of the data points you would expect in that time period (missigness ratio < 0.5).
The trigger conditions are checked for all sensors/batches once an hour, and new warnings are produced.
Queries we want to run:
- Insert a warning with the relevant data.
- Get all the warnings from the last 1h.
- Get all the warnings pertaining to sensor TSH1 from the last week.
- Get all sensors that have had a "little or no data" warning in the last week.

mhauru commented 1 year ago

Some unfinished sketches for scenarios, to be fleshed out in more detail.

Scenario 5: Operational and control parameters

Need to store current values of parameters like what is the AC unit set to, or what is the thermostat set to
Need to store changes to these values
Some things might be set to a schedule, e.g. lights automatically cycling in a 24h schedule. This needs to be stored separately of, but linked to, the current status of "are the lights on right now".
Could only store instances of "are the lights on right now" when they deviate from expected
Queries we want to run:
- What was the value of a parameter X at time Y in the past (or Y could be right now)
- Does the value of parameter Y differ from the value of some relevant sensor reading
- When was the last time that the value of some parameter Y changed?

Scenario 6: Labour data

Who works at the farm, what is their role
"Farm operator number 15 harvest crop batch number 23 at time X"

Scenario 7: Cost data

How much does energy/farm labour/etc. cost
How does crop yield relate to revenue

Scenario 8: Image data

Does this actually essentially fold under scenario 1 by similarity?

vijja-w commented 1 year ago

Slides: from today: https://docs.google.com/presentation/d/1bz3JLZMXxjPN6eCQBD3244kBYLuDeD5OHpVOhKk3_s8/edit?usp=sharing

Visualisation of minimal example for explit typing and implicit typing.

mhauru commented 1 year ago

I wrote down how the data structures would look like for the first 5 scenarios in the minimalistic schema I propose above.

Schema

Tables, with their columns and column data types:

object:

object_id (int)
name (str)
metadata (jsonb)

relation_type:

relation_type_id (int)
name (str)

relation:

object_1_id (foreign key from object.object_id)
relation_type_id (foreign key from relation_type.relation_type_id)
object_2_id (foreign key from object.object_id)
metadata (jsonb)

data:

datum_id (int)
object_id (foreign key to object.object_id)
timestamp (timestamp)
value (jsonb)

The first three tables, object, relation_type, and relation, form a directed graph. Objects are the nodes, relations are the edges, and relation types are edge labels. This can be seen as a knowledge graph. The data table holds the bulk of the data, all the raw readings. Each datum is associated with an object and a timestamp.

Because the first three tables are nothing but a labeled, directed graph, with both nodes (objects) and edges (relations) able to carry arbitrary json data on them, below I will describe how data would be stored in those tables by literally drawing a graph.

In the drawings all blue boxes are rows in the object table. Black lines with arrows are rows in the relation table; the arrow points from object_1_id to object_2_id, and the text next to the arrow is the name of the relation type. The contents of the relation_type table are left implicit, they are just all the arrow labels that appear in the diagrams. Red boxes are rows in the data table, and the arrowless red lines lead to the object this datum belongs to (the data.object_id column). If either an object or a relation has JSON data in the metadata column then I've written that below the name in curly braces {}.

Scenarios

See the comment above for descriptions of the scenarios. I didn't always capture the full scenario, but enough to show that I could easily do that, just to keep the graphs simpler. After scenario 1 I got sick of typing out "is instance of", and I started abbreviating it as "i.i.o.".

Scenario 1: Temperature and humidity data

Note how the moving of sensors from one location to another is captured in the "from" timestamps in the "located at" relations, captured in the relation.metadata JSON column.

Scenario 2: TH model data

I captured only the GES model, because it's more complicated.

Scenario 3: Crop batch data

I capture here what the data would look like for a single batch. Note that this one doesn't use the data table at all, because the things to be captured require too much linkage between objects for that to work. Adding a new batch creates one new row in object and a handful of rows in relation, one for each event like sowing and harvesting.

Scenario 4: Warnings

Here the combination of type of warning and source of warning, e.g. "sensor 15 isn't reporting data", is an object, and every time such a warning is raised that creates a new row in data referencing that object.

Scenario 5: Operational and control parameters

This one is pretty trivial, but that may be because we haven't fleshed the scenario out much. There may be complications that we don't anticipate, since we've never worked with this sort of data.

Each row in the data table captures when something has changed. Here the lighting schedule was set once, so there's only one row in data referencing that. The lights have been toggled three times. (There's a typo in one of the timestamps, sorry.)

Thoughts

Note that this is still missing all the stuff about how querying these data structures would look like. I just didn't have time to do that yet.

Note also that these are not the only ways to fit these scenarios into this schema, they are just the best ones I could think of.

The one aspect of the schema that I most hesitated on, was whether in the data table there should, rather than a single object_id column, be two columns called source_object_id and data_type_object_id, both of which would foreign key reference the object table. This would avoid a couple of complications, e.g. I wouldn't need the "Temperature in Celsius recorded by TS1" object, or the "GES run # 15 temperature lower bound" object. In some other instances, most notably the operational parameters scenario, I wouldn't have a use for both columns though. Still a bit undecided on this, but the change is easy to make if we want to.

mhauru commented 1 year ago

Given that this is just a directed labeled graph with some metadata and foreign key links to the data table, I looked a bit into graph databases. They are databases custom built for storing graph data, distinct from relational databases that store tabular data.

Pros:

More performant if the graph gets big (lots of objects and relations)
Come with query languages that are purpose built for querying graph data, and might thus be very expressive for our needs.

Cons:

Could complicate our set up a lot by having the data split between two different kinds of databases: The data table would still need to reside in a SQL database, and would need to reference objects in the graph database.
Would require learning a new query language for the graph data.
The whole ecosystem around graph databases is much less mature than the SQL ecosystem.

The current market leader in graph databases seems to be Neo4j, which unfortunately is proprietary software. There are many graph query languages, I didn't understand the differences. There are also some extensions for PostgreSQL that do graph stuff, but the ones I looked either looked immature or not doing what we need.

I'm currently not excited about using a graph database (see cons above), but we should keep in mind that they are an option, especially if our graph grows bigger than anticipated and performance becomes an issue. Looking at the above scenarios I would expect at most a few hundred new nodes in the graph (rows in object) per day, maybe a bit more for a bigger farm with lots of batches and simulation runs. This would result in something like ~100k rows a year, and if we partition old data into separate tables, that sounds manageable to me.

vijja-w commented 12 months ago

Slides from today: https://docs.google.com/presentation/d/1D-qgTUPa-O0o1NYH_yPjNa_5qQokyBGqoCsS9_6Yqcc/edit?usp=sharing

Comparison of implicit and explicit typing for Scenario 1.

vijja-w commented 11 months ago

Overview + explict typing/implict typing. Not complete.

https://docs.google.com/document/d/13q6m0Mnn4_DO7da5SozNn3UB8Te2ODpVm479gd-vBPU/edit?usp=sharing

mhauru commented 11 months ago

By the way, turns out there's a Turing interest group on knowledge graphs:

mhauru commented 11 months ago

As proposed by @rmward61, we should add a

Scenario 9: Data updates and annotations

Like the usual time series data, but data might be

updated with new values, when e.g. a sensor calibration issue is fixed retrospectively
annotated with quality notices like "not to be trusted, sensor malfunctioning". The annotations might come in retrospectively, and might change.

We might also want to access the old versions even after the fixes come in.

Should probably talk about this with Rothamstead folks, to make sure we capture all the possible things that might happen under this heading.

mhauru commented 11 months ago

I just read @vijja-w's notes, below are some comments on that and on questions that came up in our meeting on Tuesday. I'm abbrevating implicit typing as IT and explicit typing as ET.

Multiple relation tables

I don't understand how splitting into multiple relation tables would work in IT. Any distinction between e.g. what is an object type and what is a data type is in the data (in the relations it has), not in the database schema, so I wouldn't know which table to write a relation to.

IT, as proposed, would not have data-data or data-object relation tables. Those are the ones that might get really big. The only relation table in IT would be object-object, which would also cover what in ET would be object type - object type, object type - data type, data type - object type, and data type - data type (which I think should be tables to make use of foreign keys). The object-object table in IT would never get very large, so I don't think there would be any need to split it into subtables. Doing so would complicate queries, because you would have to hard code which relation table something gets written into.

Filtering by object type

We do not have ‘object_type’ or ‘data_type’ so doing things like ‘filtering by object_type’ is not natural under implicit typing.

I would disagree on the naturality claim. Here's how I imagine the queries would look like for getting all sensors that are of type "temperature sensor". (I haven't actually tested if my SQL syntax here is valid, but it's close.)

In ET:

SELECT * FROM object
JOIN object_type ON object.type_id = object_type.id
WHERE object_type.name = 'temperature sensor'

In IT:

SELECT * FROM object o1
JOIN relation ON relation.object_1_id = o1.id
JOIN object o2 ON relation.object_2_id = o2.id
WHERE o2.name = 'temperature sensor'
AND relation.relation_type = 'is instance of'

That's a tiny bit more complicated in IT, but I wouldn't say it's unnatural or hard to follow. If you want to query instead for all sensors of any sensor type, the queries I think would look like this:

In IT:

SELECT * FROM object o1
JOIN relation r1 ON r1.object_1_id = o1.id
JOIN relation r2 ON r2.object_1_id = r1.object_2_id
JOIN object o2 ON r2.object_2_id = o2.id
WHERE o2.name = 'sensor'
AND r1.relation_type = 'is instance of'
AND r2.relation_type = 'is subtype of'

So that's saying "pick all objects that are instances of a type that is in turn a subtype of 'sensor'". The query is very similar in ET:

SELECT * FROM object
JOIN object_type ot1 ON object.type_id = ot1.id
JOIN object_type_object_type_relation ototr ON ototr.object_type_1_id = ot1.id
JOIN object_type ot2 ON ototr.object_type_2_id = ot2.id
WHERE ot2.name = 'sensor'
AND ototr.relation_type = 'is subtype of'

That's assuming we have an object_type_object_type_relation table. If we don't, and rather things like one type being a subtype of another are stored in a JSON column, then I don't think there's any way to do this in a single query.

We also have the added benefit of having all queries be faster (looking through smaller ‘object_type’ and ‘object tables’, as opposed to the much bigger ‘relations’ table)

I think the speed difference here would be negligible, because relations would still be something like 100-10,000 rows, which is quite trivial for postgresql, especially since this table is very easy to index efficiently.

Forcing the existence of ‘is_instance_of’, is forcing the existence of an ‘object_type’ - ‘instance’ hierarchy.

Why not just model this explicitly and not have to force anything?

This is a very good question, and I think gets to the heart of the matter. If object_type - instance, aka "is instance of", was the only relation to model, I think we should just have them be separate tables. But I think we will also need many other kinds of relations, such as "is subtype of", "is located at", and "is instance of" where the instance itself is a type, which further has its own instances. Once we have a multitude of such relation types, the question to me turns the other way around: Since we already need to have a way of keeping track of generic relations of all kinds, why take one of those relations, "is instance of", and treat it differently, when we could just as well use the same system for all relations? I'll comment on this more below in the pros/cons section.

Insertions

How would you look up the constraints to use [in IT]?

Very much the same way as in ET: You look up the relevant type, which you can find in the object table, and the body of it holds the constraints that any instance of this type must fulfill. The only difference is which table you look into, and whether something being a type and something else being an instance of a type is marked by them being in separate tables with a foreign key link, or by them having a relation of "is instance of".

Pros and cons as I seem them

Pros of ET

Queries related to type-instance relations are a bit simpler, because they don't use the relation table. This typically saves one JOIN in the SQL query. There may also be a performance benefit to this, though I suspect it will be very small, since all the object/type stuff is very small compared to the data table, which is where the bulk of the query will be.
For some types of objects the type-instance distinction is very obvious, and having it captured in things being in two different tables is intuitive and keeps the structure very clear. For instance, "temperature sensor" is a type and "temperature sensor number 3" is an instance.
The hierarchy of types and instances is very familiar to all programmers, which may help people new to the project.

Pros of IT

Less tables. The fact that there would be 4 relation tables linking object_type and data_type with each other and within themselves could be confusing. I can easily imagine looking for a relation in the wrong table. Some queries would also get more complex, because you would need to remember to check multiple tables.
Alternatively, if we don't use the 4 relation tables with ET, then we lose the ability to follow foreign key links and have them impose hard constraints. This I think would be an even bigger drawback compared to having many tables.
ET has a built-in two-level hierarchy, and some data might naturally need more levels.
Everything is an object or a relation, and all relations are an equal footing. This I think has a few benefits:
- It's conceptually very simple. Note also the similarity to how e.g. in Python classes are themselves objects of class type.
- Sometimes it's not obvious what is an object, what is an object type, and what is a data type. Is "warning" an object type, and "missing data warning" an instance of it? Or is "missing data warning" an object type and "missing data warning number 5" an object, and "warning" an object supertype? Or maybe "warning" and "missing data warning" are data types, so that some sensor can report data of the type "missing data warning"? Any of these seem like plausible choices, but depending on which you choose, you write the data in the database in a very different way. With IT, the only choice to be made here is what you want to call the relation types between the objects "warning", "missing data warning", and "missing data warning number 5", and you don't necessarily need to shoehorn those relations into predetermined structure of object types and data types if that doesn't seem like a natural fit.
- Related to the above, by virtue of being more flexible, IT has a higher chance of capturing any user needs we fail to foresee.

The last two points are the ones that weigh the most for me: Since we aim something very flexible, I worry that the things will come up, where the more rigid structure of object types, data types, and objects is not a natural fit. This could happen as we develop new features, or when a user tries to independently use DTBase for something we did not plan for. Whereas the very simple structure of everything being made of objects and relations between them is quite universal, and you can fit a lot of use cases into its mold without having to hackily bend things out of shape.

vijja-w commented 11 months ago

Thanks for this @mhauru, really appreciate you going through this in detail and giving such a detailed response. I have read through your response. Let's chat when I get back (although like I said I am happy to hop on calls here and there if required). Having said that here is my comment:

Multiple relation tables

In the IT schema (as it currently is) there exists:

object-object relations (only)
data-object relations are implemented through foreign keys.

I think point 2 is not great since we can not have one data be associated with many objects: useful for things like reservations: reservation (data) being associated with Slot (object) and Batch (object).

This can be side-stepped using ‘proxy objects’ where you define an object being called: ‘reservations-belonging-to-slot-1-and-batch-1' and then associate data with these objects.

This is however also problematic as the number of these proxy objects has multiplicative scaling.

data-object relation tables solves this without having to do hacky things like the above.

The existence of the object-data and data-data tables are optional bonuses which I think will help us cover edge-cases but that is beyond the scope of this point.

Filtering by object type

You are absolutely right, I think the term ‘unnatural’ is overkill (sorry).

I’m not sure why we need JOINs (I’m not fluent with JOINs). Would the following not be sufficient?

In ET, given the object_type with

SELECT * FROM object WHERE object_type_id = ;

Insertions

Very much the same way as in ET: You look up the relevant type, which you can find in the object table, and the body of it holds the constraints that any instance of this type must fulfill.

This is for constraints on the body, which I agree and understand. A different constraint I am unsure about is constraints on relations.

Two examples where this is relevant:

Inserting an object of type ‘tray’ requires the object to have a relation with an object with type ‘slot’ and another object of type ‘grow batch’
Inserting data of type ‘reservation’ requires there to be a relation to an object of type ‘slot’ and another object of type ‘grow batch’

In ET the constraints would look like this:

(1)

The object I am inserting is of object_type ‘tray’.
Look this up in the object_type table:
the ‘related_data’ and ‘related_object’ columns tell you which object_types it must be associated with. ***
You have your relation constraint.

(2)

The data I am inserting is of type 'reservation'.
Look this up in the data_type table:
the 'related_data' and 'related_object' columns tell you which object_types it must be associated with. ***
You have your relation constraint.

Note it is very easy to write code to validate these constraints too, just loop through the id in 'related_data' and 'related_object' and check that the insertion request has all of them.

How does the above procedure look like in IT?

Comment 1

I do not think we should have four tables for:

object_type_object_type relation
object_type_data_type relation
data_type_object_type relation
data_type_data_type relation

We should fold it up as dictionaries with foreign keys in them as suggested in the current version of ET.

We are only using these to check for constraints after all (see *** above)

Comment 2

Please note that ET also has relation types and so far I have been able to use only one relation type to model all the scenarios. I think this shows how flexible the schema is (IT and ET), we don’t actually need to define that many different relation types to be able to capture the data structure, most of the heavy lifting is done once we define the objects/object_types.

Comment 3

The only difference between ET and IT is that ET has two more tables: object_type and data_type. (all the four relation table stuff doesn’t actually need to be there in either IT or ET, I just think it’s a nice thing to have).

Having said this, I think IT is actually a subset of ET. Consider the following: if we simply leave the object_type and data_type tables in ET empty, we would get IT.

ET just adds ‘typing’ tables which allow us to group objects and data in semantically meaningful ways, allowing for simple look ups, sorting and filtering. ET also has the added benefit that relational constraints are given to you immediately, based on how object_types and data_types are linked together.

Something I want to make clear

The current IT and ET schema as it was proposed has many differences eg. four relation tables vs one relation table + data-foreignkey-object. These are not fundamentally different for ET and IT, they are just what was proposed under the schema, both can be applied to ET and IT.

The thing that is fundamentally different - as a consequence of choosing ET or IT is the fact that there are two additional typing tables for ET.

mhauru commented 10 months ago

Hi @vijja-w, happy new year!

Thanks for the comments, here are some further thoughts on them:

On proxy objects and object-data relations

I agree that proxy objects are a bit clumsy, and feel like they add complexity to the schema that could be avoided. With a data-object table my main worry is performance, since the queries related to the data table are the heavy ones. Note also that the data-object table would need to have a column for relation type, to make clear what role different objects serve for a data point, e.g. that one is the source and another is the quantity measured.

I'll need to give this more thought, and maybe a test of performance would be in order as well. Neither option seems ideal, and I'm not sure what the best balance is.

I would avoid a data-data table. It could easily explode in size and kill performance, and I can't think of use cases for it that wouldn't equally well or better be served by data-object relations, with proxy objects if need be.

Filtering

I’m not sure why we need JOINs (I’m not fluent with JOINs). Would the following not be sufficient?

In ET, given the object_type with

SELECT * FROM object WHERE object_type_id = ;

That works if you know the ID, I just wanted to cater for the case when you rather only know the object type name, which is what the user of the backend API would give us. You could first do one query to find the ID of the type with the given name, and the a second query to get the objects with this ID. With a JOIN we bake these two things into a single query.

Relation constraints on insertion

How does the above procedure look like in IT?

It would be the same except for how we look for the related_data and/or related_object stuff. You would run a query of "what are all the objects related to this one by a relation of instance must relate to" (or whatever we want to call the relation type). The query would look something like this:

SELECT * FROM object o1
JOIN relation r ON r.object_1_id = o1.id
JOIN object o2 ON r.object_2_id = o2.id
WHERE o2.name = 'tray'
AND r.relation_type = 'instance must relate to'

That query would return two objects, "slot" and "grow batch".

I neglected to put these sorts of relations between types in the diagrams I drew above, I may amend that if I find a moment.

On having multiple relation tables between `object_type` and `data_type` in in ET

We should fold it up as dictionaries with foreign keys in them as suggested in the current version of ET.

PostgreSQL can't enforce a foreign key relation on multiple values in the same cloumn in a dictionary (or a list, or JSON). So your foreign keys wouldn't be enforced on the database level, and you also couldn't do JOINs over them.

We are only using these to check for constraints after all (see *** above)

I don't think that's going to be the only use. Two others that come to mind now are getting various sorts of data, e.g. a query to get all sensors of any sensor type which relies on the is subtype of relation, and to document relations between objects, e.g. that some model output measure relates to the GES model and not to the Arima model.

Number of relation types

Please note that ET also has relation types and so far I have been able to use only one relation type to model all the scenarios.

I think having descriptive names for relation types is good for understandability and documentation. In the diagrams I drew above, I ended up using around 20 different relation types: is instance of, of quantity, from sensor, records quantity, located at, located in section, is subtype of, output measure type for model, parameter type for model, run type for model, run for run output, measure for run output, run parameter, batch sown/planned/moved/harvested (4 relation types), recipe for, of warning type, from source, and schedule for

It's not absolutely necessary to have that many, but I think it makes the data easier to understand. Use cases we haven't explicitly planned for would probably add their own relation types too.

vijja-w commented 10 months ago

ET vs IT recap slides: https://docs.google.com/presentation/d/1nR2oRVAhr_2iaTaiUn3VTP57YiRsu-z557kfl6YIjRc/edit?usp=sharing

mhauru commented 10 months ago

Enumerating the design decisions to be made:

1) Will the data table have a) a single FK reference to the objects table, or b) two references to source object and data type object?

Both Markus and Pat leaning towards b). We haven't though of a scenario where b) would be unnatural, and a) would require a lot of proxy objects.

1) Will there be a many-to-many relation table between data and object tables, or just a foreign key column in data?

1) Will types be in a separate table from objects?

1) Will data types be in a separate table from objects and object types?

1) Will objects refer to their types using a foreign key column or through a many-to-many relation table?

1) Will constraints be in a JSON column or in a separate constraints table?

1) If object types and/or data types are in a separate table, will they reference each other using many-to-many relation table or using JSON columns?

Note that while many of these seem related to each other, they are in fact mostly independent, with the exception of the last one.

@vijja-w did I forget something?

vijja-w commented 10 months ago

I put together the design choices using a 'full' schematic that models a lot of the concepts that we've discussed. We can then switch out components or remove them as required. Please have a look @mhauru https://docs.google.com/document/d/1VnbgMnvD_FQK6mb-_wVPW-13HlHIn7rJgIN8zK87FDc/edit?usp=sharing

mhauru commented 10 months ago

Thanks Pat, I especially like this summary:

Consideration

Implicit modelling these concepts (proxy, tags, etc.) requires less tables

Less tables imply more rows in the tables - increase in lookup time (doesn’t matter too much for objects as we don’t have too many of these)

More flexibility in implicit modelling but less robust (compare application-side constraint and database-side constraints)

I would maybe add a point that implicit typing is conceptually simpler and requires less choices about where to put things or look for them, but runs a larger risk of users abusing the flexibility and making a mess of their objects table by having very confusing relations between them (you could insert a relation of "is instance of" between two things neither one of which is a type, for instance). This is maybe more an expansion of your last point than an entirely separate point.

On a broader point, I feel like we are reaching a stage where we are not introducing many new ideas any more. I think the next major step might need to be someone sitting down, making a preliminary decision on how to do this (see list of questions above), start implementing it, and in the process of implementing remain open to changing those preliminary decisions. I would start with implementing the backend API functions that have all the database queries in them, it'll be in writing those that one will start to see whether the chosen schema is elegant or annoying.

It might be a few weeks off until me or Edwin has time for this. If I was to start implementing this today I would go with the following choices, but I'll keep this thinking about this in the background.

Will the data table have a) a single FK reference to the objects table, or b) two references to source object and data type object?

b). Two foreign key columns pointing to the objects table (see below question 4.).

Will there be a many-to-many relation table between data and object tables, or just a foreign key column in data?

Single foreign key column. Proxy objects in the the objects table if need be, but hopefully rarely needed.

Will types be in a separate table from objects?

No, in the same table.

Will data types be in a separate table from objects and object types?

No, in the same table.

Will objects refer to their types using a foreign key column or through a many-to-many relation table?

Foreign key column. Makes many queries a bit simpler, partially hard-codes the "is instance of" relation in the schema, by giving it a special status where it is not in the relations table like everything else.

Will constraints be in a JSON column or in a separate constraints table?

This one I'm actually still very unsure of, I haven't thought about it much. We should sketch out what the data format for the constraints table/data would be like.

If object types and/or data types are in a separate table, will they reference each other using many-to-many relation table or using JSON columns?

N/A.

vijja-w commented 9 months ago

Quick recap of 'instructions' stuff: https://docs.google.com/presentation/d/133_N9WnoWd-QTDDCxJVhZLhqVcWJqj6KvrL41OcLnr0/edit?usp=sharing

mhauru commented 7 months ago

The current status of this is that we've converged on a design, with some details to still be settled. That design is captured in the messages above. @vijja-w made a start on implementing the new database design but I think didn't get very far until other things took his time, and otherwise development has not progressed for a couple of months now.

Once someone gets back to this, they should read the above discussion to understand the new design and its purpose, and then implement the necessary tables and CRUD functions in the backend. Various aspects of the backend (sensors, locations, services, etc.) can then be ported over to the new database structure one by one, keeping both the old style tables and the new style tables in parallel. Hopefully during this process it will become obvious what the right choices are for some of the open questions that @vijja-w and I kept going back and forth on above. Once the new database design is in place and all features use it the old tables can be dropped, though probably some sort of automated migration pathway should be provided, for moving existing data from one design to another.

alan-turing-institute / DTBase

Discussion of relational database schema #131

Scenario 1: Temperature and humidity data

Scenario 2: TH model data

Scenario 3: Crop batch data

Scenario 4: Warnings

Scenario 5: Operational and control parameters

Scenario 6: Labour data

Scenario 7: Cost data

Scenario 8: Image data

Schema

Scenarios

Scenario 1: Temperature and humidity data

Scenario 2: TH model data

Scenario 3: Crop batch data

Scenario 4: Warnings

Scenario 5: Operational and control parameters

Thoughts

Scenario 9: Data updates and annotations

Multiple relation tables

Filtering by object type

Insertions

Pros and cons as I seem them

Pros of ET

Pros of IT

Multiple relation tables

Filtering by object type

Insertions

Comment 1

Comment 2

Comment 3

Something I want to make clear

On proxy objects and object-data relations

Filtering

Relation constraints on insertion

On having multiple relation tables between object_type and data_type in in ET

Number of relation types

On having multiple relation tables between `object_type` and `data_type` in in ET