Open mhauru opened 1 year ago
Pat's notes from our latest chat: https://docs.google.com/document/d/1o8vGaGX6xcMTGS6ndvYF47RfMEeq6Hp6bQqKB0k-KFw
One of my thoughts, with Pat's questions and my answers:
I was thinking more about this database design. Here's one idea of how it might work, adapting Pat's design, with three tables: Table: Object (this one is both object and template in Pat's language) Columns: name, arbitrary JSON metadata Examples: The concept of temperature, the concept of temperature sensor, temperature sensor #15, the concept of location, location at coordinates (1, 5), task to do a thing by Friday Table: Relation type Columns: name, JSON metadata including a schema for metadata that relations of this type must have Examples: "is instance of", "sensor is at location", "sensor type records measure" Table: Relation Column: relation type id, object 1 id, object 2 id, JSON metadata that conforms to schema of the type Examples: "temperature sensor #15 is an instance of the concept of temperature sensor", "temperature sensor records temperature measurements", "sensor #15 is at location #23" Table: Data Columns: object id, timestamp, JSON data. Maybe additional columns for raw floats, ints and bools if it helps (performance?), otherwise all data in the JSON column. Examples: "reading from sensor #15 at datetime blahblah is 15.2" Why did I fold both objects and templates into a single table? Because once I started thinking of relations I quickly realised I want relations between templates too. I think the object, relation type, and relation tables together form an ontology or a knowledge graph or a semantic web (I'm unclear on the terminology here). A known, common data structure. Why not fold the data into the object table as well? Because I suspect that would make queries very slow. The point of the data table is that the vast masses of data go in that one, the object table should probably only have 10-1000 rows ever, data can have millions. I'm not quite sure why I want everything to have a JSON column for "metadata". It's kinda a contingency plan, because you can dump any extra stuff that doesn't fit nicely anywhere else there. Maybe they are not needed for e.g. relation or relation type. Something that this doesn't do well yet is that I don't know how to associate several objects to a data point (how do I say that a reading is from sensor #15 and of measure "temperature"), nor data points to each other (maybe there are several values coming out of a single sensor reading, and they should be linked). You could always work around this by putting more stuff in the JSON field (you can put anything into JSON...), but that doesn't seem like a very elegant or efficient solution. I'm also considering a concept of data batches, either to be put in the objects table or in a separate table, but it doesn't seem infinitely elegant either. On a related note, I can think of the following three criteria for any database design we consider: Easy to write queries for. Fast to execute queries. Flexible enough to accommodate any needs that may arise. Am I missing some fourth criterion, or is that it?
I went through Markus' idea yesterday I have some questions and proposals. Let's discuss them on Friday but it would be good if you guys had a look through it if you have time. Comment: I like that there is this 'ontology that emerges'. I like that this allows for relations between objects. Questions: How would the look up through the relation table look like? Say I have object1 related to object2 (stored in the objects table). If I want to ask: given object1, what is the other object associated with it? Would I have to look through the 'relations' table to find a relation that has object1 in it then see that it is also attached to object2? Might be better somehow to include the relation in the object table itself so you could look up object1 and the relation to object2 is already given. How would constraints be applied? Previously 'templates' constrained 'objects'. eg. metadata of 'DHT22' must be of the form '{temperature: float, humidity: float}', ie. only 'data' with the name 'temperature' and 'humidity' can be associated with objects of type 'DHT22', etc. If objects can be related to objects, how would we go about specifying this constraint? - Endpoints. If I have a hierarchy of objects like this: object1 -> object2 -> object3 -> data, how would the endpoint querying experience look like? would I have to query for object1, see it is related to object2, query object2, see it is related to object3, query object3 and find the data associated with it? Compare that to say another allowed hierarchy of objects: object4-> data, the querying experience seems quite different and making a standard easy to use querying experience might be difficult. Proposal: I'm not sure how the answers to these questions might look like in your proposal, but I have included it in my new proposal. I have included the idea of relations and ontology from your idea. I also included a way to associate multiple objects to a single datapoint. I only have one 'relation_type' currently but this can be extended simply - let's discuss.
We can talk more on Friday, but I'll write here short answers to two of the questions. How would the look up through the relation table look like? Say I have object1 related to object2 (stored in the objects table). If I want to ask: given object1, what is the other object associated with it? Would I have to look through the 'relations' table to find a relation that has object1 in it then see that it is also attached to object2? Might be better somehow to include the relation in the object table itself so you could look up object1 and the relation to object2 is already given. Yeah, you would do a query with two joins: object table, joined with relation table to find all the relations of object1, joined again with the object table to find object2 at the other end of that relation. I don't think this would be a problem, postgres can make these queries very fast, and this is the standard way of implementing a many-to-many relation between two tables (or in this case a table with itself). The problem with going directly from object1 to object2 in the object table is that you would either 1) only have many-to-1 and 1-to-1 relations, if each object row has a column for the object it is related to, or 2) not be able to implement these relations as foreign keys, if you have a column that lists multiple objects object1 is related to. The benefit of using foreign key constraints is that it makes it harder to insert invalid relations (e.g. referencing an object that doesn't exist) and makes many queries easier because postgres is very well equipped to follow foreign key relations and join data over them, automatically do things like if you delete a sensor type it automatically deletes all sensors related to it, etc. How would constraints be applied? Previously 'templates' constrained 'objects'. eg. metadata of 'DHT22' must be of the form '{temperature: float, humidity: float}', ie. only 'data' with the name 'temperature' and 'humidity' can be associated with objects of type 'DHT22', etc. If objects can be related to objects, how would we go about specifying this constraint? Relation types could be marked as either enforcing or not enforcing constraints. So for instance, you would define a relation type called "is a template for", and then the general concept of DHT22 sensor would be related to DHT22 sensor #23 by that relation. And then you would say that "is a template for" is a relation for which object2 (the instance) has to have its JSON metadata follow a structure specfied in the JSON metadata of object1 (the template). These constraints would still have to be enforced in application level code (or maybe postgresql triggers) just like they would if template and object were separate tables. Note also that you could specify constraints of other kinds that don't relate to the metadata. For instance, here are a couple of constraints you might imagine wanting: Any object X that has a relation of type 'is instance of' to the object 'concept of sensor' (i.e. X is a sensor) must also have a relation of type 'is located at' to some object Y, that in turn has a relation of type 'is instance of' to the object 'concept of location' (i.e. Y is a location). In other words, every sensor must have a location. Not sure how to write these sorts of constraints in the JSON metadata of one of the tables (not even sure which table I would try to put them in). It might be doable, but might also be that we shouldn't try to specify constraints in JSON, but somewhere outside the database. The question about endpoints is a big (and good) one, I'll have to think about this and we can talk on Friday. The question that most bothers me about endpoints: If we make a really general and flexible database design, then any user needs can be accommodated by it. No user will want to think in super abstract terms though, so then we have to design nice and understandable API end points that hide the abstract nature of the database design. But if we want to make the API end points nice and understandable, then they won't be as general and flexible any more. So then if someone wants to add new functionality, they would need to write new API end points, even if they wouldn't have to write new database code. But if, to add more functionality to the platform, you have to go write API code, then is that really much easier than also having to write database code? In which case, what have we gained by this extreme generality and flexibility of the database design?
I would like to start a list of scenarios that any database schema design should be able to serve. I'll list them in this comment. If you want to add new ones/make edits, leave a comment and I'll edit this message. That way we have a single place that lists all the scenarios.
The point of these scenarios is not be exhaustive of the kinds of things we would like to do, nor to be fully realistic. The point is to stretch any proposed schema, see if it can accommodate different types of needs.
Ideally I would like to work out for every schema proposal we are seriously considering and for every scenario, the answer to the following questions:
Optimally the answers to question 2 would be literal SQL queries. I've written down a lot of possible queries so we probably don't want to write actual SQL for all of them, but I would like to spell out at least the trickiest ones very explicitly, it often exposes details I wouldn't think of otherwise.
Some unfinished sketches for scenarios, to be fleshed out in more detail.
Slides: from today: https://docs.google.com/presentation/d/1bz3JLZMXxjPN6eCQBD3244kBYLuDeD5OHpVOhKk3_s8/edit?usp=sharing
Visualisation of minimal example for explit typing and implicit typing.
I wrote down how the data structures would look like for the first 5 scenarios in the minimalistic schema I propose above.
Tables, with their columns and column data types:
object
:
object_id
(int)name
(str)metadata
(jsonb)relation_type
:
relation_type_id
(int)name
(str)relation
:
object_1_id
(foreign key from object.object_id
)relation_type_id
(foreign key from relation_type.relation_type_id
)object_2_id
(foreign key from object.object_id
)metadata
(jsonb)data
:
datum_id
(int)object_id
(foreign key to object.object_id
)timestamp
(timestamp)value
(jsonb)The first three tables, object
, relation_type
, and relation
, form a directed graph. Objects are the nodes, relations are the edges, and relation types are edge labels. This can be seen as a knowledge graph. The data
table holds the bulk of the data, all the raw readings. Each datum is associated with an object and a timestamp.
Because the first three tables are nothing but a labeled, directed graph, with both nodes (objects) and edges (relations) able to carry arbitrary json data on them, below I will describe how data would be stored in those tables by literally drawing a graph.
In the drawings all blue boxes are rows in the object
table. Black lines with arrows are rows in the relation
table; the arrow points from object_1_id
to object_2_id
, and the text next to the arrow is the name of the relation type. The contents of the relation_type
table are left implicit, they are just all the arrow labels that appear in the diagrams. Red boxes are rows in the data
table, and the arrowless red lines lead to the object this datum belongs to (the data.object_id
column). If either an object or a relation has JSON data in the metadata
column then I've written that below the name in curly braces {}.
See the comment above for descriptions of the scenarios. I didn't always capture the full scenario, but enough to show that I could easily do that, just to keep the graphs simpler. After scenario 1 I got sick of typing out "is instance of", and I started abbreviating it as "i.i.o.".
Note how the moving of sensors from one location to another is captured in the "from" timestamps in the "located at" relations, captured in the relation.metadata
JSON column.
I captured only the GES model, because it's more complicated.
I capture here what the data would look like for a single batch. Note that this one doesn't use the data
table at all, because the things to be captured require too much linkage between objects for that to work. Adding a new batch creates one new row in object
and a handful of rows in relation
, one for each event like sowing and harvesting.
Here the combination of type of warning and source of warning, e.g. "sensor 15 isn't reporting data", is an object, and every time such a warning is raised that creates a new row in data
referencing that object.
This one is pretty trivial, but that may be because we haven't fleshed the scenario out much. There may be complications that we don't anticipate, since we've never worked with this sort of data.
Each row in the data
table captures when something has changed. Here the lighting schedule was set once, so there's only one row in data
referencing that. The lights have been toggled three times. (There's a typo in one of the timestamps, sorry.)
Note that this is still missing all the stuff about how querying these data structures would look like. I just didn't have time to do that yet.
Note also that these are not the only ways to fit these scenarios into this schema, they are just the best ones I could think of.
The one aspect of the schema that I most hesitated on, was whether in the data
table there should, rather than a single object_id
column, be two columns called source_object_id
and data_type_object_id
, both of which would foreign key reference the object
table. This would avoid a couple of complications, e.g. I wouldn't need the "Temperature in Celsius recorded by TS1" object, or the "GES run # 15 temperature lower bound" object. In some other instances, most notably the operational parameters scenario, I wouldn't have a use for both columns though. Still a bit undecided on this, but the change is easy to make if we want to.
Given that this is just a directed labeled graph with some metadata and foreign key links to the data
table, I looked a bit into graph databases. They are databases custom built for storing graph data, distinct from relational databases that store tabular data.
Pros:
Cons:
data
table would still need to reside in a SQL database, and would need to reference objects in the graph database.The current market leader in graph databases seems to be Neo4j, which unfortunately is proprietary software. There are many graph query languages, I didn't understand the differences. There are also some extensions for PostgreSQL that do graph stuff, but the ones I looked either looked immature or not doing what we need.
I'm currently not excited about using a graph database (see cons above), but we should keep in mind that they are an option, especially if our graph grows bigger than anticipated and performance becomes an issue. Looking at the above scenarios I would expect at most a few hundred new nodes in the graph (rows in object
) per day, maybe a bit more for a bigger farm with lots of batches and simulation runs. This would result in something like ~100k rows a year, and if we partition old data into separate tables, that sounds manageable to me.
Slides from today: https://docs.google.com/presentation/d/1D-qgTUPa-O0o1NYH_yPjNa_5qQokyBGqoCsS9_6Yqcc/edit?usp=sharing
Comparison of implicit and explicit typing for Scenario 1.
Overview + explict typing/implict typing. Not complete.
https://docs.google.com/document/d/13q6m0Mnn4_DO7da5SozNn3UB8Te2ODpVm479gd-vBPU/edit?usp=sharing
By the way, turns out there's a Turing interest group on knowledge graphs:
As proposed by @rmward61, we should add a
Like the usual time series data, but data might be
We might also want to access the old versions even after the fixes come in.
Should probably talk about this with Rothamstead folks, to make sure we capture all the possible things that might happen under this heading.
I just read @vijja-w's notes, below are some comments on that and on questions that came up in our meeting on Tuesday. I'm abbrevating implicit typing as IT and explicit typing as ET.
I don't understand how splitting into multiple relation tables would work in IT. Any distinction between e.g. what is an object type and what is a data type is in the data (in the relations it has), not in the database schema, so I wouldn't know which table to write a relation to.
IT, as proposed, would not have data-data or data-object relation tables. Those are the ones that might get really big. The only relation table in IT would be object-object, which would also cover what in ET would be object type - object type, object type - data type, data type - object type, and data type - data type (which I think should be tables to make use of foreign keys). The object-object table in IT would never get very large, so I don't think there would be any need to split it into subtables. Doing so would complicate queries, because you would have to hard code which relation table something gets written into.
We do not have ‘object_type’ or ‘data_type’ so doing things like ‘filtering by object_type’ is not natural under implicit typing.
I would disagree on the naturality claim. Here's how I imagine the queries would look like for getting all sensors that are of type "temperature sensor". (I haven't actually tested if my SQL syntax here is valid, but it's close.)
In ET:
SELECT * FROM object
JOIN object_type ON object.type_id = object_type.id
WHERE object_type.name = 'temperature sensor'
In IT:
SELECT * FROM object o1
JOIN relation ON relation.object_1_id = o1.id
JOIN object o2 ON relation.object_2_id = o2.id
WHERE o2.name = 'temperature sensor'
AND relation.relation_type = 'is instance of'
That's a tiny bit more complicated in IT, but I wouldn't say it's unnatural or hard to follow. If you want to query instead for all sensors of any sensor type, the queries I think would look like this:
In IT:
SELECT * FROM object o1
JOIN relation r1 ON r1.object_1_id = o1.id
JOIN relation r2 ON r2.object_1_id = r1.object_2_id
JOIN object o2 ON r2.object_2_id = o2.id
WHERE o2.name = 'sensor'
AND r1.relation_type = 'is instance of'
AND r2.relation_type = 'is subtype of'
So that's saying "pick all objects that are instances of a type that is in turn a subtype of 'sensor'". The query is very similar in ET:
SELECT * FROM object
JOIN object_type ot1 ON object.type_id = ot1.id
JOIN object_type_object_type_relation ototr ON ototr.object_type_1_id = ot1.id
JOIN object_type ot2 ON ototr.object_type_2_id = ot2.id
WHERE ot2.name = 'sensor'
AND ototr.relation_type = 'is subtype of'
That's assuming we have an object_type_object_type_relation
table. If we don't, and rather things like one type being a subtype of another are stored in a JSON column, then I don't think there's any way to do this in a single query.
We also have the added benefit of having all queries be faster (looking through smaller ‘object_type’ and ‘object tables’, as opposed to the much bigger ‘relations’ table)
I think the speed difference here would be negligible, because relations
would still be something like 100-10,000 rows, which is quite trivial for postgresql, especially since this table is very easy to index efficiently.
- Forcing the existence of ‘is_instance_of’, is forcing the existence of an ‘object_type’ - ‘instance’ hierarchy.
- Why not just model this explicitly and not have to force anything?
This is a very good question, and I think gets to the heart of the matter. If object_type - instance
, aka "is instance of", was the only relation to model, I think we should just have them be separate tables. But I think we will also need many other kinds of relations, such as "is subtype of", "is located at", and "is instance of" where the instance itself is a type, which further has its own instances. Once we have a multitude of such relation types, the question to me turns the other way around: Since we already need to have a way of keeping track of generic relations of all kinds, why take one of those relations, "is instance of", and treat it differently, when we could just as well use the same system for all relations? I'll comment on this more below in the pros/cons section.
How would you look up the constraints to use [in IT]?
Very much the same way as in ET: You look up the relevant type, which you can find in the object
table, and the body of it holds the constraints that any instance of this type must fulfill. The only difference is which table you look into, and whether something being a type and something else being an instance of a type is marked by them being in separate tables with a foreign key link, or by them having a relation of "is instance of".
JOIN
in the SQL query. There may also be a performance benefit to this, though I suspect it will be very small, since all the object/type stuff is very small compared to the data table, which is where the bulk of the query will be.object_type
and data_type
with each other and within themselves could be confusing. I can easily imagine looking for a relation in the wrong table. Some queries would also get more complex, because you would need to remember to check multiple tables.type
.The last two points are the ones that weigh the most for me: Since we aim something very flexible, I worry that the things will come up, where the more rigid structure of object types, data types, and objects is not a natural fit. This could happen as we develop new features, or when a user tries to independently use DTBase for something we did not plan for. Whereas the very simple structure of everything being made of objects and relations between them is quite universal, and you can fit a lot of use cases into its mold without having to hackily bend things out of shape.
Thanks for this @mhauru, really appreciate you going through this in detail and giving such a detailed response. I have read through your response. Let's chat when I get back (although like I said I am happy to hop on calls here and there if required). Having said that here is my comment:
In the IT schema (as it currently is) there exists:
I think point 2 is not great since we can not have one data be associated with many objects: useful for things like reservations: reservation (data) being associated with Slot (object) and Batch (object).
This can be side-stepped using ‘proxy objects’ where you define an object being called: ‘reservations-belonging-to-slot-1-and-batch-1' and then associate data with these objects.
This is however also problematic as the number of these proxy objects has multiplicative scaling.
data-object relation tables solves this without having to do hacky things like the above.
The existence of the object-data and data-data tables are optional bonuses which I think will help us cover edge-cases but that is beyond the scope of this point.
You are absolutely right, I think the term ‘unnatural’ is overkill (sorry).
I’m not sure why we need JOINs (I’m not fluent with JOINs). Would the following not be sufficient?
In ET, given the object_type with
SELECT * FROM object WHERE object_type_id =
Very much the same way as in ET: You look up the relevant type, which you can find in the object table, and the body of it holds the constraints that any instance of this type must fulfill.
This is for constraints on the body, which I agree and understand. A different constraint I am unsure about is constraints on relations.
Two examples where this is relevant:
In ET the constraints would look like this:
(1)
(2)
Note it is very easy to write code to validate these constraints too, just loop through the id in 'related_data' and 'related_object' and check that the insertion request has all of them.
How does the above procedure look like in IT?
I do not think we should have four tables for:
We should fold it up as dictionaries with foreign keys in them as suggested in the current version of ET.
We are only using these to check for constraints after all (see *** above)
Please note that ET also has relation types and so far I have been able to use only one relation type to model all the scenarios. I think this shows how flexible the schema is (IT and ET), we don’t actually need to define that many different relation types to be able to capture the data structure, most of the heavy lifting is done once we define the objects/object_types.
The only difference between ET and IT is that ET has two more tables: object_type and data_type. (all the four relation table stuff doesn’t actually need to be there in either IT or ET, I just think it’s a nice thing to have).
Having said this, I think IT is actually a subset of ET. Consider the following: if we simply leave the object_type and data_type tables in ET empty, we would get IT.
ET just adds ‘typing’ tables which allow us to group objects and data in semantically meaningful ways, allowing for simple look ups, sorting and filtering. ET also has the added benefit that relational constraints are given to you immediately, based on how object_types and data_types are linked together.
The current IT and ET schema as it was proposed has many differences eg. four relation tables vs one relation table + data-foreignkey-object. These are not fundamentally different for ET and IT, they are just what was proposed under the schema, both can be applied to ET and IT.
The thing that is fundamentally different - as a consequence of choosing ET or IT is the fact that there are two additional typing tables for ET.
Hi @vijja-w, happy new year!
Thanks for the comments, here are some further thoughts on them:
I agree that proxy objects are a bit clumsy, and feel like they add complexity to the schema that could be avoided. With a data-object
table my main worry is performance, since the queries related to the data
table are the heavy ones. Note also that the data-object
table would need to have a column for relation type, to make clear what role different objects serve for a data point, e.g. that one is the source and another is the quantity measured.
I'll need to give this more thought, and maybe a test of performance would be in order as well. Neither option seems ideal, and I'm not sure what the best balance is.
I would avoid a data-data
table. It could easily explode in size and kill performance, and I can't think of use cases for it that wouldn't equally well or better be served by data-object
relations, with proxy objects if need be.
I’m not sure why we need JOINs (I’m not fluent with JOINs). Would the following not be sufficient?
In ET, given the object_type with
SELECT * FROM object WHERE object_type_id =
;
That works if you know the ID, I just wanted to cater for the case when you rather only know the object type name, which is what the user of the backend API would give us. You could first do one query to find the ID of the type with the given name, and the a second query to get the objects with this ID. With a JOIN we bake these two things into a single query.
How does the above procedure look like in IT?
It would be the same except for how we look for the related_data
and/or related_object
stuff. You would run a query of "what are all the objects related to this one by a relation of instance must relate to
" (or whatever we want to call the relation type). The query would look something like this:
SELECT * FROM object o1
JOIN relation r ON r.object_1_id = o1.id
JOIN object o2 ON r.object_2_id = o2.id
WHERE o2.name = 'tray'
AND r.relation_type = 'instance must relate to'
That query would return two objects, "slot" and "grow batch".
I neglected to put these sorts of relations between types in the diagrams I drew above, I may amend that if I find a moment.
object_type
and data_type
in in ETWe should fold it up as dictionaries with foreign keys in them as suggested in the current version of ET.
PostgreSQL can't enforce a foreign key relation on multiple values in the same cloumn in a dictionary (or a list, or JSON). So your foreign keys wouldn't be enforced on the database level, and you also couldn't do JOINs over them.
We are only using these to check for constraints after all (see *** above)
I don't think that's going to be the only use. Two others that come to mind now are getting various sorts of data, e.g. a query to get all sensors of any sensor type which relies on the is subtype of
relation, and to document relations between objects, e.g. that some model output measure relates to the GES model and not to the Arima model.
Please note that ET also has relation types and so far I have been able to use only one relation type to model all the scenarios.
I think having descriptive names for relation types is good for understandability and documentation. In the diagrams I drew above, I ended up using around 20 different relation types: is instance of, of quantity, from sensor, records quantity, located at, located in section, is subtype of, output measure type for model, parameter type for model, run type for model, run for run output, measure for run output, run parameter, batch sown/planned/moved/harvested (4 relation types), recipe for, of warning type, from source, and schedule for
It's not absolutely necessary to have that many, but I think it makes the data easier to understand. Use cases we haven't explicitly planned for would probably add their own relation types too.
Enumerating the design decisions to be made:
1) Will the data table have a) a single FK reference to the objects table, or b) two references to source object and data type object?
1) Will there be a many-to-many relation table between data and object tables, or just a foreign key column in data?
1) Will types be in a separate table from objects?
1) Will data types be in a separate table from objects and object types?
1) Will objects refer to their types using a foreign key column or through a many-to-many relation table?
1) Will constraints be in a JSON column or in a separate constraints table?
1) If object types and/or data types are in a separate table, will they reference each other using many-to-many relation table or using JSON columns?
Note that while many of these seem related to each other, they are in fact mostly independent, with the exception of the last one.
@vijja-w did I forget something?
I put together the design choices using a 'full' schematic that models a lot of the concepts that we've discussed. We can then switch out components or remove them as required. Please have a look @mhauru https://docs.google.com/document/d/1VnbgMnvD_FQK6mb-_wVPW-13HlHIn7rJgIN8zK87FDc/edit?usp=sharing
Thanks Pat, I especially like this summary:
Consideration
- Implicit modelling these concepts (proxy, tags, etc.) requires less tables
- Less tables imply more rows in the tables - increase in lookup time (doesn’t matter too much for objects as we don’t have too many of these)
- More flexibility in implicit modelling but less robust (compare application-side constraint and database-side constraints)
I would maybe add a point that implicit typing is conceptually simpler and requires less choices about where to put things or look for them, but runs a larger risk of users abusing the flexibility and making a mess of their objects table by having very confusing relations between them (you could insert a relation of "is instance of" between two things neither one of which is a type, for instance). This is maybe more an expansion of your last point than an entirely separate point.
On a broader point, I feel like we are reaching a stage where we are not introducing many new ideas any more. I think the next major step might need to be someone sitting down, making a preliminary decision on how to do this (see list of questions above), start implementing it, and in the process of implementing remain open to changing those preliminary decisions. I would start with implementing the backend API functions that have all the database queries in them, it'll be in writing those that one will start to see whether the chosen schema is elegant or annoying.
It might be a few weeks off until me or Edwin has time for this. If I was to start implementing this today I would go with the following choices, but I'll keep this thinking about this in the background.
- Will the data table have a) a single FK reference to the objects table, or b) two references to source object and data type object?
b). Two foreign key columns pointing to the objects table (see below question 4.).
- Will there be a many-to-many relation table between data and object tables, or just a foreign key column in data?
Single foreign key column. Proxy objects in the the objects
table if need be, but hopefully rarely needed.
- Will types be in a separate table from objects?
No, in the same table.
- Will data types be in a separate table from objects and object types?
No, in the same table.
- Will objects refer to their types using a foreign key column or through a many-to-many relation table?
Foreign key column. Makes many queries a bit simpler, partially hard-codes the "is instance of" relation in the schema, by giving it a special status where it is not in the relations
table like everything else.
- Will constraints be in a JSON column or in a separate constraints table?
This one I'm actually still very unsure of, I haven't thought about it much. We should sketch out what the data format for the constraints table/data would be like.
- If object types and/or data types are in a separate table, will they reference each other using many-to-many relation table or using JSON columns?
N/A.
Quick recap of 'instructions' stuff: https://docs.google.com/presentation/d/133_N9WnoWd-QTDDCxJVhZLhqVcWJqj6KvrL41OcLnr0/edit?usp=sharing
The current status of this is that we've converged on a design, with some details to still be settled. That design is captured in the messages above. @vijja-w made a start on implementing the new database design but I think didn't get very far until other things took his time, and otherwise development has not progressed for a couple of months now.
Once someone gets back to this, they should read the above discussion to understand the new design and its purpose, and then implement the necessary tables and CRUD functions in the backend. Various aspects of the backend (sensors, locations, services, etc.) can then be ported over to the new database structure one by one, keeping both the old style tables and the new style tables in parallel. Hopefully during this process it will become obvious what the right choices are for some of the open questions that @vijja-w and I kept going back and forth on above. Once the new database design is in place and all features use it the old tables can be dropped, though probably some sort of automated migration pathway should be provided, for moving existing data from one design to another.
A place for us to collect our discussion on how to possibly change the database schema to be more flexible.