Closed rnllv closed 3 years ago
Thanks @arunlal-v. Interesting question - and now is definitely the right time to consider it (before the users have a load of Task entries). I'm not 100% sure that switching how we store Tasks is the right thing to do, but it has also raised a number of interesting relevant questions in my mind, so it's definitely worth thinking about. I've had a few thoughts which I'll try and jot down sometime - but this week I'm involved in a conference full-time, so won't have much time to work on this. I'll be able to provide a full response next week.
The article was an interesting read @arunlal-v
It's important that @Robin & I understand the idea. I think the change would extend to Platforms and Participants.
We have these concepts for Tasks:
TaskGroups
It may be a good idea to create a SQLFiddle-playground demo table, showing the new structure, some sample data, plus how we would run queries against it.
Here's a first demo: http://sqlfiddle.com/#!17/3d6e9/5
Let me know if any other scenarios are to be included.
P.S.: The queries demonstrated in the fiddle can be achieved, even though a little cluttered, using Adjacency list table design as well.
Sure @robintw.
To illustrate point 3 mentioned the original post, consider the below Tasks table with Adjacency design and "ON DELETE CASCADE" specified on the foreign key.
postgres=# select * from tasks;
task_id | parent_task_id | name | start
---------+----------------+-------+----------------------------------
1 | | Task1 | 2021-03-30 23:24:33.990656+05:30
2 | 1 | Task2 | 2021-03-30 23:24:34.01517+05:30
3 | 2 | Task3 | 2021-03-30 23:24:34.039694+05:30
4 | | Task4 | 2021-03-25 23:24:34.064555+05:30
5 | 4 | Task5 | 2021-03-25 23:24:34.089038+05:30
6 | 5 | Task6 | 2021-03-25 23:24:34.113507+05:30
(6 rows)
postgres=# delete from tasks where task_id=1;
DELETE 1
postgres=# select * from tasks;
task_id | parent_task_id | name | start
---------+----------------+-------+----------------------------------
4 | | Task4 | 2021-03-25 23:24:34.064555+05:30
5 | 4 | Task5 | 2021-03-25 23:24:34.089038+05:30
6 | 5 | Task6 | 2021-03-25 23:24:34.113507+05:30
(3 rows)
As can be seen, the "ON DELETE CASCADE" action defined on adjacency list Tasks table deletes task 2 and task 3 definitions. This means, we have to give up either the ability to hold task definitions without it being in any relation ( and preventing us to reuse tasks in another hierarchy) or the ability to use "ON DELETE CASCADE" action that helps clean up orphaned records automatically.
A Tasks table with Closure table will lose only the specific task definition and the Task_Relationship table will be cleared of any records referencing the deleted task in ancestor or descendent columns.
Based on analysis of data in pepys.Tasks table, the details shared in issue 840 and in this thread, and my discussion with @IanMayo, I think we can remodel the Tasks and its related tables without recursion.
Consider this design: http://sqlfiddle.com/#!17/876ed3/1
I believe this closely resembles real life war game series. The Tasks table in the new design can be enhanced to include iteration details of the war game series (or the iteration can be arrived upon based on calculated rank after sorting all tasks of a given task_group by year), but those are finer points that we can consider later.
This design can also handle ad-hoc war games that do not belong to any series, like a joint US-Japan war game in response to an aggressor in the South China Sea.
This design eliminates
One advantage of this design is that task_group.name and task.name can be directly used as Level 1 and Level 2 filters respectively from UI.
@IanMayo, @robintw, please let me know if there are any scenarios that cannot be handled by the new design.
Thanks @arunlal-v,
couple of items of feedback:
1) We can't use the term task_group
, since it already has meaning in my user domain https://en.wikipedia.org/wiki/Task_Group
2) I'm sure the users need the two levels of tasks. Let me mock structure:
090
110
, to allow for others to be created between it and 130
090
090
110
, to allow for others to be created between it and 130
As you'll see, we can't put all of those J05090
entries under Joint Warrior
. The indexes are only unique within that particular wargame (JW 20/01
).
Let me introduce a naming convention for the tiers:
The presence of JW 20/01
also helps create/maintain the lower level. The analysts receive a document listing all the ships (participants) in a wargame. They add it to the wargame as participants. Then as they are adding the serials, they only have to select participants from the limited list defined for the wargame.
@arunlal-v - I'd appreciate you updating the SQLFiddle to reflect the above data model.
Considering the review comments, this is how the Tasks module should be designed. This is a long form version of the design. Once this is finalised, we can move into ER Diagrams and SQLFiddles.
"Exercises" should be the central point. (In our example, JW 20/01 is an exercise) An exercise, in Debrief parlance, is comprised of a set of "Serials" (J05090, J05110, etc. are examples of serials) An exercise has a non empty set of "Participants" (ex. Ships, Aircrafts etc.) A Serial in a given exercise involves "Participation" from only the set of participants mapped for that exercise. An exercise may be part of a "Recurring Series" (In DB terms, it will be represented by "Series" and "Series_Exercises" tables)
The following are the advantages of using a normalized structure, like above, instead of a self-referential table with a generic structure to capture disparate attributes.
1) The front end (having different input forms and filters to capture information about different modules like Exercises, Serials ) has to maintain and process the recursion logic each time it has to store any entity data to the table, as each entity is in the same table with different levels of recursion. In case of the normalised design, the front end just has to update specific table to save specific information without any additional logic
2) It helps to extend the application with new functionality, if required. A new feature like Track the next Serials can be based off of just the Exercises/Serials entity (in the new design) and a new events table, if required. This implementation will be complex if the self-referental design of Tasks is used as it has join the entire Tasks table containing information about exercises, Serials, and series. Another new feature like tracking Ground Rules (ex. Warships can only use 'n' number of torpedo simulations, etc.) can be introduced just by creating new tables and referencing them in either Participant/Participation entities (in the new design), instead of referencing from the entire tasks table.
Thanks @arunlal-v - have renamed games
to serials
.
Here's the SQLFiddle: http://sqlfiddle.com/#!17/9dcc4/5
Changed Series_Events to Series_Exercises
Updated SQLFiddle with name of Participants and Participations changed to Exercise_Participants and Serial_Participants respectively.
Please find below updated long form factoring in the name changes
"Exercises" are the central point in Debrief to analyze war games. An exercise, in Debrief parlance, is a logical collection of games or "Serials". An exercise has a set of participants called "Exercise Participants" (ex. Platforms like Ships, Aircrafts etc.) A serial in a given exercise has a set of participants called "Serial Participants" which is always a subset of exercise participants mapped to the given exercise. An exercise may be part of "Series Exercises" which is a logical set of exercises tracked under a "Series".
@IanMayo, as per the above, a platform that takes part in a serial will have more than one entry in "Serial Participants" for a given serial (i.e. more than one serial_participant_id for the same exercise_participant_id, serial_id combo) based on the duration during which it was active. "Serial Participants" also has start and end attributes which is not the case with "Exercise Participants". So should we rename "Serial Participants" to "Serial Participation"?
This is the ER diagram for the new structure. (Privacy fields and other metadata attributes are to be added)
I'm busy with childcare today, and won't be able to look at this properly until Tuesday (Monday is a public holiday in the UK). However, in general I like the new approach. I've got a couple of questions based on the ER diagram though:
series_exercises
table exist? Can't we just have a series_id
field in the exercises
table? That would remove one extra layer of tables, and make it easier for me to query the exercises that are part of each series (particularly as doing queries with lots of joins can get complicated with SQLAlchemy). I thought these sort of 'intermediate tables' were only needed when there was a many-to-many relationship, but (as I understand it) each series can have multiple exercises, but each exercise can only be in one series.(I think I've got another question about exercise_participants
, but I need to have a bit more of a think about that).
No issues @robintw. Enjoy your long weekend :)
each series can have multiple exercises
is handled in the design. each exercise can only be in one series
can be enforced by a unique key constraint on series_exercises.exercise_id. But your suggestion: Can't we just have a series_id field in the exercises table?
is also valid. We'll have one less table to join, but there will be null values for exercise.series_id for exercises that are not part of any series. We can discuss further when you're back.Thanks @arunlal-v.
In response to 1) - yes I think that it would be better to remove the series_exercises
table and just have an exercise.series_id
field which can be null if the exercise isn't associated with a series. That will work best with the way that I'm dealing with database objects in the UI.
My next question is about why we need both serial_participants
and exercise_participants
, and why serial_participants
references exercise_participants
. Could we not just have a participants
table (like we do currently) and reference that from both exercises
and serials
? It would mean we just had a table of participants where the ID was either a serial ID or an exercise ID, and then it referenced the platform etc.
I also don't quite understand what you said above:
a platform that takes part in a serial will have more than one entry in "Serial Participants" for a given serial (i.e. more than one serial_participant_id for the same exercise_participant_id, serial_id combo) based on the duration during which it was active.
Why would there be more than one entry in Serial Participants? Is that just if it is active for various distinct time periods? (eg. 09:00-11:00 and then 15:00-18:00) - I don't know whether this actually happens - @IanMayo, is this something we need to support?
Just as a final comment: it might be useful to extend the SQLFiddle a bit to include some more entries in the data so we can see more easily how this works for various more complex situations. For example, adding an exercise that has no series, adding multiple platforms assigned to different exercises/serials, with different durations etc.
In response to 1) - yes I think that it would be better to remove the series_exercises table and just have an exercise.series_id field which can be null if the exercise isn't associated with a series. That will work best with the way that I'm dealing with database objects in the UI.
Sure @robintw, we can remove series_exercises
table and refer series_id in exercises table. @IanMayo, do you have any comments here?
My next question is about why we need both serial_participants and exercise_participants
They were originally named "*Exercise Participants" and "Serial Participa*tion". The latter was renamed as requested by @IanMayo. I was checking with him if we would still want to go ahead with this change. Referring to the long form design, the "Exercise Participants" contains the set of platforms earmarked for this exercise. When analysts need to mark any platform against a serial, the platform dropdown element (or UI filter) will be populated from this set instead of displaying the entire platforms defined in pepys."Platforms" table. This approach also handles scenario where a platform couldn't take part in any of the serials (due to technical or other issues), but our reports can pull out all earmarked platforms for the given exercise.
Why would there be more than one entry in Serial Participants? Is that just if it is active for various distinct time periods? (eg. 09:00-11:00 and then 15:00-18:00) - I don't know whether this actually happens - @IanMayo, is this something we need to support?
Could see from point 1 in this comment that this might be possible. Anyways, the design can handle multiple participation in a single serial. I was just informing @IanMayo that renaming "Serial Participation" to "Serial Participants" would be confusing for so and so reasons.
Just as a final comment: it might be useful to extend the SQLFiddle a bit to include some more entries in the data so we can see more easily how this works for various more complex situations. For example, adding an exercise that has no series, adding multiple platforms assigned to different exercises/serials, with different durations etc.
Sure thing @robintw, I'll get started on that once the design is a little stable.
Hello @robintw & @arunlal-v
I checked with the users. No, this doesn't happen.
I think the root cause to needing two tables of participation is that we've separated the 3 tiers of Tasks
into Series
, Exercises
and Serials
. Once we separate Exercises
and Serials
we need two different tables to record participations in each one. There are merits to separating the two tables, since more detail is captured at the Serial
level.
A platform first gets assigned to an Exercise (Exercise_Participants) and then to a child serial (Serial_Participants). The presence of a Exercise_Participant
in a Serial_Participant
is how we ensure a serial participant is already present in the exercise. We currently enforce this through Python logic, but the above ensure the logic is enforced in the database, too.
I'm happy for the users to an Other
Series, into which ad-hoc exercises are stored. If it becomes unweildy they have the tools to separate Other
down further - either by year or exercise type. That's their core housekeeping responsibility.
So, we do not need an Exercise Series
table.
The above logic also keeps @robintw 's task table in the UI logical - since every exercise sits within a Series.
Thanks @IanMayo. I'll share the updated design in sometime.
Here's the SQLFiddle with updated design
Here's the ER Diagram.
@IanMayo, do we need privacy_id field added for all these new tables?
Yes, we'll need Privacy
for all of them, please.
@robintw - in the UI, by default we can use the lowest level value in the Privacies
table, and not force the user to select a value.
@IanMayo Do we definitely need a privacy for all of those tables? Or just for Exercises, Series and Serials? Given that Platforms have their own privacy - can the participants tables do without a privacy?
Hello @robintw - I welcome you testing the logic. In this instance there may be a justification.
Let's imagine HMS Queen Elizabeth (QE).
The fact she exists is public knowledge. The fact she is taking part in JW 21/01 is sensitive until it is published by the MOD. So, that means the "Exercise participant" value for QE/JW21/01 is "Private" until the date of the news release when it becomes "Public". Some of the Serials in JW21/01 may be public, since the MoD may have had to send out messages to fishing craft / aircraft about if there is going to be some GPS Jamming (for instance). But, others won't be shared, nor will the details of who is taking part in which serials.
The above is quite a bit of juggling, but it's quite familiar to the analysts.
Your example makes sense - lets go forward with privacies for everything.
Okay @robintw, @IanMayo.
Any other fields to be added?
Here is the schema:
We need environment
and location
for Serials
, we need Force
for Serial_Participants
Sure @IanMayo. Will include these fields. Also, the new default data state in pepys."PlatformTypes" and frequency in pepys."Series" are interval datatypes. We don't have that in SQLite.
How about DefaultDataIntervalSecs
as an integer?
Got it.
Yes, I think just handle it as seconds in an integer field - but most other field names we have are lower-case with dashes separating words, so maybe default_data_interval
or default_data_interval_secs
?
default_data_interval_secs
will be descriptive and preferred SQL way.
To be consistent with the other tables, I'll be going with camelCase naming for table names.
@IanMayo, what should be the datatype for start and end. Should we retain it as "Timestamp without timezone" or follow Postgres recommendation?
While data may come from different timezones, all systems capturing maritime data run at UTC, which effectively makes the TimeZone of no value.
There may be an argument in the future for changing to allow TimeZone, but this issue isn't a sufficient reason to move away from what's followed in the rest of Pepys database.
Fine
Here's the updated sqlFiddle [Privacy field missed. Updating in sometime]
Here's the ER diagram.
Sorry I didn't mention this earlier @arunlal-v - we don't need fruqency_days
in Series
. It doesn't bring any value to the users.
@IanMayo, I'm guessing the answer for my following question is "No". But just wanted to get it confirmed by you.
There will be only one serial happening at any given time for a exercise, right?
And should we consider multiple serials in a day?
There will typically be up to 5 serials per day.
The serials can overlap, though a participant can only be in one serial at a time.
While the "big players" may all do a fast-paced one-hour serial where the aircraft carrier gets attacked by aircraft, some other units may be doing a 6 hour serial that involves a slower paced undertaking such as a commando reconnaissance patrol.
@IanMayo You said:
I'm happy for the users to an Other Series, into which ad-hoc exercises are stored. If it becomes unweildy they have the tools to separate Other down further - either by year or exercise type. That's their core housekeeping responsibility.
I agree, that will be very helpful for me from the UI perspective. Should we create this Other
series as part of the default metadata that can be added when the database is created? Then we can always assume it exists, and add things to it if necessary.
I don't think we need to create the Other
series. The users can create their own top-level series - and add wargames to it.
and add things to it if necessary.
I can only picture Pepys adding wargames to a series by a user first selecting the series (maybe after creating it), and then adding the wargame.
@IanMayo, I believe we'll be needing one or more of the following fields in pepys."Serials". Kindly confirm.
base_identifier text - to capture J in J05324 day_of_month integer - to capture 05 in J05324 serial_index integer - to capture 324 in J05324
In the dashboard picture shared in issue 840, I see the serial name as CASEX 324, where as in the examples you have shared as comments in other replies in the same issue has J05324.
@IanMayo, with reference to this comment, what serial name should the dashboard display if there are multiple serials for the previous day?
@IanMayo, I believe we'll be needing one or more of the following fields in pepys."Serials". Kindly confirm.
No, these serial values are human generated. We don't need to produce or interpret them.
There are two names for a Serial. The J06230
is a human-created value that is used to uniquely refer to each part of the wargame. The CASEX 324
is the name of a pre-scripted exercise contained in a book held by the participants. The CASEX 324
may occur several times across the wargame. So, it's useful for the participant to know what they're going to be doing, but doesn't uniquely identify a serial/segment/component.
That is why both are needed, and why analysts find the CASEX 324
field useful in addition to the exercise.
Here is where @robintw added the Exercise
field:
https://github.com/debrief/pepys-import/issues/836
Based on the above, I think we need @arunlal-v to modify the table design:
serial.name
to serial.serial_number
and introduce an optional exercise
text field.@IanMayo, with reference to this comment, what serial name should the dashboard display if there are multiple serials for the previous day?
The Serial
(J06230) will always be displayed - since they are unique for this wargame. But, analysts will find value in the exercise
label too (CASEX 324) - so we'll display both.
Aah, @arunlal-v - I realise we need another field for SerialParticipants
. The participant will be playing as either Blue Force
(friendly) or Red Force
(enemy). One day we may need to support White Force
(non-combatant).
I think this could be met through a Force Types
table, containing:
This will allow a compulsory Force
field in the SerialParticipants
table.
pepys."Tasks" table has a self-referential design (popularly known as Adjacency list anti-pattern in SQL Designs). Self-referencing table design is considered as a design anti-pattern for the below reasons.
1) For tree like data structure represented by Adjacency list, the operations on the nodes cannot be done independent of the edges. In other words, two logically different functions (DML on Task definitions, and DML on Task relations) rely on the same table.
2) Join operations for tree querying has to process the Task definition each time.
3) Cascading deletes cannot be specified as they might induce a recursive sequence of deletes.
4) Depth of tree has to be hard coded into the query. This issue is invalidated by the SQL-99 standard, which defines recursive query syntax using the WITH keyword followed by a common table expression. And currently, Sqlite and Postgres have implemented this standard/feature.
One commonly cited design pattern to implement tree data is the Closure tables. Closure tables also allows for many-to-many relationships. Should we consider redesigning pepys."Tasks" using this?
If the recursion depth is going to be static, i.e. only 3 levels of depth is allowed, then can we change our table design to reflect this instead of going for a recursive structure? This will make sure that only one task id column is present and the other two levels are just their corresponding level 2 and 3 attributes.