debrief / pepys-import

Support library for Pepys maritime data analysis environment
https://pepys-import.readthedocs.io/
Apache License 2.0
5 stars 5 forks source link

pepys."Tasks" table - Self-Referential design #839

Closed rnllv closed 3 years ago

rnllv commented 3 years ago

pepys."Tasks" table has a self-referential design (popularly known as Adjacency list anti-pattern in SQL Designs). Self-referencing table design is considered as a design anti-pattern for the below reasons.

1) For tree like data structure represented by Adjacency list, the operations on the nodes cannot be done independent of the edges. In other words, two logically different functions (DML on Task definitions, and DML on Task relations) rely on the same table.

2) Join operations for tree querying has to process the Task definition each time.

3) Cascading deletes cannot be specified as they might induce a recursive sequence of deletes.

4) Depth of tree has to be hard coded into the query. This issue is invalidated by the SQL-99 standard, which defines recursive query syntax using the WITH keyword followed by a common table expression. And currently, Sqlite and Postgres have implemented this standard/feature.

One commonly cited design pattern to implement tree data is the Closure tables. Closure tables also allows for many-to-many relationships. Should we consider redesigning pepys."Tasks" using this?

If the recursion depth is going to be static, i.e. only 3 levels of depth is allowed, then can we change our table design to reflect this instead of going for a recursive structure? This will make sure that only one task id column is present and the other two levels are just their corresponding level 2 and 3 attributes.

robintw commented 3 years ago

Thanks @arunlal-v. Interesting question - and now is definitely the right time to consider it (before the users have a load of Task entries). I'm not 100% sure that switching how we store Tasks is the right thing to do, but it has also raised a number of interesting relevant questions in my mind, so it's definitely worth thinking about. I've had a few thoughts which I'll try and jot down sometime - but this week I'm involved in a conference full-time, so won't have much time to work on this. I'll be able to provide a full response next week.

IanMayo commented 3 years ago

The article was an interesting read @arunlal-v

It's important that @Robin & I understand the idea. I think the change would extend to Platforms and Participants.

We have these concepts for Tasks:

It may be a good idea to create a SQLFiddle-playground demo table, showing the new structure, some sample data, plus how we would run queries against it.

rnllv commented 3 years ago

Here's a first demo: http://sqlfiddle.com/#!17/3d6e9/5

Let me know if any other scenarios are to be included.

P.S.: The queries demonstrated in the fiddle can be achieved, even though a little cluttered, using Adjacency list table design as well.

rnllv commented 3 years ago

Sure @robintw.

To illustrate point 3 mentioned the original post, consider the below Tasks table with Adjacency design and "ON DELETE CASCADE" specified on the foreign key.

postgres=# select * from tasks;
 task_id | parent_task_id | name  |              start               
---------+----------------+-------+----------------------------------
       1 |                | Task1 | 2021-03-30 23:24:33.990656+05:30
       2 |              1 | Task2 | 2021-03-30 23:24:34.01517+05:30
       3 |              2 | Task3 | 2021-03-30 23:24:34.039694+05:30
       4 |                | Task4 | 2021-03-25 23:24:34.064555+05:30
       5 |              4 | Task5 | 2021-03-25 23:24:34.089038+05:30
       6 |              5 | Task6 | 2021-03-25 23:24:34.113507+05:30
(6 rows)

postgres=# delete from tasks where task_id=1;
DELETE 1
postgres=# select * from tasks;
 task_id | parent_task_id | name  |              start               
---------+----------------+-------+----------------------------------
       4 |                | Task4 | 2021-03-25 23:24:34.064555+05:30
       5 |              4 | Task5 | 2021-03-25 23:24:34.089038+05:30
       6 |              5 | Task6 | 2021-03-25 23:24:34.113507+05:30
(3 rows)

As can be seen, the "ON DELETE CASCADE" action defined on adjacency list Tasks table deletes task 2 and task 3 definitions. This means, we have to give up either the ability to hold task definitions without it being in any relation ( and preventing us to reuse tasks in another hierarchy) or the ability to use "ON DELETE CASCADE" action that helps clean up orphaned records automatically.

A Tasks table with Closure table will lose only the specific task definition and the Task_Relationship table will be cleared of any records referencing the deleted task in ancestor or descendent columns.

rnllv commented 3 years ago

Based on analysis of data in pepys.Tasks table, the details shared in issue 840 and in this thread, and my discussion with @IanMayo, I think we can remodel the Tasks and its related tables without recursion.

Consider this design: http://sqlfiddle.com/#!17/876ed3/1

I believe this closely resembles real life war game series. The Tasks table in the new design can be enhanced to include iteration details of the war game series (or the iteration can be arrived upon based on calculated rank after sorting all tasks of a given task_group by year), but those are finer points that we can consider later.

This design can also handle ad-hoc war games that do not belong to any series, like a joint US-Japan war game in response to an aggressor in the South China Sea.

This design eliminates

One advantage of this design is that task_group.name and task.name can be directly used as Level 1 and Level 2 filters respectively from UI.

rnllv commented 3 years ago

@IanMayo, @robintw, please let me know if there are any scenarios that cannot be handled by the new design.

IanMayo commented 3 years ago

Thanks @arunlal-v, couple of items of feedback: 1) We can't use the term task_group, since it already has meaning in my user domain https://en.wikipedia.org/wiki/Task_Group 2) I'm sure the users need the two levels of tasks. Let me mock structure:

As you'll see, we can't put all of those J05090 entries under Joint Warrior. The indexes are only unique within that particular wargame (JW 20/01).

Let me introduce a naming convention for the tiers:

  1. Series
  2. Wargame
  3. Serial

The presence of JW 20/01 also helps create/maintain the lower level. The analysts receive a document listing all the ships (participants) in a wargame. They add it to the wargame as participants. Then as they are adding the serials, they only have to select participants from the limited list defined for the wargame.

@arunlal-v - I'd appreciate you updating the SQLFiddle to reflect the above data model.

rnllv commented 3 years ago

Considering the review comments, this is how the Tasks module should be designed. This is a long form version of the design. Once this is finalised, we can move into ER Diagrams and SQLFiddles.

"Exercises" should be the central point. (In our example, JW 20/01 is an exercise) An exercise, in Debrief parlance, is comprised of a set of "Serials" (J05090, J05110, etc. are examples of serials) An exercise has a non empty set of "Participants" (ex. Ships, Aircrafts etc.) A Serial in a given exercise involves "Participation" from only the set of participants mapped for that exercise. An exercise may be part of a "Recurring Series" (In DB terms, it will be represented by "Series" and "Series_Exercises" tables)

The following are the advantages of using a normalized structure, like above, instead of a self-referential table with a generic structure to capture disparate attributes.

1) The front end (having different input forms and filters to capture information about different modules like Exercises, Serials ) has to maintain and process the recursion logic each time it has to store any entity data to the table, as each entity is in the same table with different levels of recursion. In case of the normalised design, the front end just has to update specific table to save specific information without any additional logic

2) It helps to extend the application with new functionality, if required. A new feature like Track the next Serials can be based off of just the Exercises/Serials entity (in the new design) and a new events table, if required. This implementation will be complex if the self-referental design of Tasks is used as it has join the entire Tasks table containing information about exercises, Serials, and series. Another new feature like tracking Ground Rules (ex. Warships can only use 'n' number of torpedo simulations, etc.) can be introduced just by creating new tables and referencing them in either Participant/Participation entities (in the new design), instead of referencing from the entire tasks table.

IanMayo commented 3 years ago

Thanks @arunlal-v - have renamed games to serials.

rnllv commented 3 years ago

Here's the SQLFiddle: http://sqlfiddle.com/#!17/9dcc4/5

rnllv commented 3 years ago

Changed Series_Events to Series_Exercises

rnllv commented 3 years ago

Updated SQLFiddle with name of Participants and Participations changed to Exercise_Participants and Serial_Participants respectively.

http://sqlfiddle.com/#!17/89161/4

rnllv commented 3 years ago

Please find below updated long form factoring in the name changes

"Exercises" are the central point in Debrief to analyze war games. An exercise, in Debrief parlance, is a logical collection of games or "Serials". An exercise has a set of participants called "Exercise Participants" (ex. Platforms like Ships, Aircrafts etc.) A serial in a given exercise has a set of participants called "Serial Participants" which is always a subset of exercise participants mapped to the given exercise. An exercise may be part of "Series Exercises" which is a logical set of exercises tracked under a "Series".

rnllv commented 3 years ago

@IanMayo, as per the above, a platform that takes part in a serial will have more than one entry in "Serial Participants" for a given serial (i.e. more than one serial_participant_id for the same exercise_participant_id, serial_id combo) based on the duration during which it was active. "Serial Participants" also has start and end attributes which is not the case with "Exercise Participants". So should we rename "Serial Participants" to "Serial Participation"?

rnllv commented 3 years ago

This is the ER diagram for the new structure. (Privacy fields and other metadata attributes are to be added)

ERDiagram

robintw commented 3 years ago

I'm busy with childcare today, and won't be able to look at this properly until Tuesday (Monday is a public holiday in the UK). However, in general I like the new approach. I've got a couple of questions based on the ER diagram though:

  1. Why does the series_exercises table exist? Can't we just have a series_id field in the exercises table? That would remove one extra layer of tables, and make it easier for me to query the exercises that are part of each series (particularly as doing queries with lots of joins can get complicated with SQLAlchemy). I thought these sort of 'intermediate tables' were only needed when there was a many-to-many relationship, but (as I understand it) each series can have multiple exercises, but each exercise can only be in one series.

(I think I've got another question about exercise_participants, but I need to have a bit more of a think about that).

rnllv commented 3 years ago

No issues @robintw. Enjoy your long weekend :)

  1. It's because there could be exercises that are not part of any series like PASSEX 20/06 and ADEX 20/01. The new design makes exercises independent of any series. Based on whether the exercises are part of a series, these two tables (series and series_exercises) can be joined to the query or ignored. The relationships can be restricted using constraints. each series can have multiple exercises is handled in the design. each exercise can only be in one series can be enforced by a unique key constraint on series_exercises.exercise_id. But your suggestion: Can't we just have a series_id field in the exercises table? is also valid. We'll have one less table to join, but there will be null values for exercise.series_id for exercises that are not part of any series. We can discuss further when you're back.
robintw commented 3 years ago

Thanks @arunlal-v.

In response to 1) - yes I think that it would be better to remove the series_exercises table and just have an exercise.series_id field which can be null if the exercise isn't associated with a series. That will work best with the way that I'm dealing with database objects in the UI.

My next question is about why we need both serial_participants and exercise_participants, and why serial_participants references exercise_participants. Could we not just have a participants table (like we do currently) and reference that from both exercises and serials? It would mean we just had a table of participants where the ID was either a serial ID or an exercise ID, and then it referenced the platform etc.

I also don't quite understand what you said above:

a platform that takes part in a serial will have more than one entry in "Serial Participants" for a given serial (i.e. more than one serial_participant_id for the same exercise_participant_id, serial_id combo) based on the duration during which it was active.

Why would there be more than one entry in Serial Participants? Is that just if it is active for various distinct time periods? (eg. 09:00-11:00 and then 15:00-18:00) - I don't know whether this actually happens - @IanMayo, is this something we need to support?

Just as a final comment: it might be useful to extend the SQLFiddle a bit to include some more entries in the data so we can see more easily how this works for various more complex situations. For example, adding an exercise that has no series, adding multiple platforms assigned to different exercises/serials, with different durations etc.

rnllv commented 3 years ago

In response to 1) - yes I think that it would be better to remove the series_exercises table and just have an exercise.series_id field which can be null if the exercise isn't associated with a series. That will work best with the way that I'm dealing with database objects in the UI.

Sure @robintw, we can remove series_exercises table and refer series_id in exercises table. @IanMayo, do you have any comments here?

My next question is about why we need both serial_participants and exercise_participants

They were originally named "*Exercise Participants" and "Serial Participa*tion". The latter was renamed as requested by @IanMayo. I was checking with him if we would still want to go ahead with this change. Referring to the long form design, the "Exercise Participants" contains the set of platforms earmarked for this exercise. When analysts need to mark any platform against a serial, the platform dropdown element (or UI filter) will be populated from this set instead of displaying the entire platforms defined in pepys."Platforms" table. This approach also handles scenario where a platform couldn't take part in any of the serials (due to technical or other issues), but our reports can pull out all earmarked platforms for the given exercise.

Why would there be more than one entry in Serial Participants? Is that just if it is active for various distinct time periods? (eg. 09:00-11:00 and then 15:00-18:00) - I don't know whether this actually happens - @IanMayo, is this something we need to support?

Could see from point 1 in this comment that this might be possible. Anyways, the design can handle multiple participation in a single serial. I was just informing @IanMayo that renaming "Serial Participation" to "Serial Participants" would be confusing for so and so reasons.

Just as a final comment: it might be useful to extend the SQLFiddle a bit to include some more entries in the data so we can see more easily how this works for various more complex situations. For example, adding an exercise that has no series, adding multiple platforms assigned to different exercises/serials, with different durations etc.

Sure thing @robintw, I'll get started on that once the design is a little stable.

IanMayo commented 3 years ago

Hello @robintw & @arunlal-v

Platform participates in a serial multiple times.

I checked with the users. No, this doesn't happen.

Two sets of participants (participations)

I think the root cause to needing two tables of participation is that we've separated the 3 tiers of Tasks into Series, Exercises and Serials. Once we separate Exercises and Serials we need two different tables to record participations in each one. There are merits to separating the two tables, since more detail is captured at the Serial level.

A platform first gets assigned to an Exercise (Exercise_Participants) and then to a child serial (Serial_Participants). The presence of a Exercise_Participant in a Serial_Participant is how we ensure a serial participant is already present in the exercise. We currently enforce this through Python logic, but the above ensure the logic is enforced in the database, too.

Exercises not part of a Series

I'm happy for the users to an Other Series, into which ad-hoc exercises are stored. If it becomes unweildy they have the tools to separate Other down further - either by year or exercise type. That's their core housekeeping responsibility.

So, we do not need an Exercise Series table.

The above logic also keeps @robintw 's task table in the UI logical - since every exercise sits within a Series.

rnllv commented 3 years ago

Thanks @IanMayo. I'll share the updated design in sometime.

rnllv commented 3 years ago

Here's the SQLFiddle with updated design

Here's the ER Diagram.

image

@IanMayo, do we need privacy_id field added for all these new tables?

IanMayo commented 3 years ago

Privacy Fields.

Yes, we'll need Privacy for all of them, please.

@robintw - in the UI, by default we can use the lowest level value in the Privacies table, and not force the user to select a value.

robintw commented 3 years ago

@IanMayo Do we definitely need a privacy for all of those tables? Or just for Exercises, Series and Serials? Given that Platforms have their own privacy - can the participants tables do without a privacy?

IanMayo commented 3 years ago

Hello @robintw - I welcome you testing the logic. In this instance there may be a justification.

Let's imagine HMS Queen Elizabeth (QE).

image

The fact she exists is public knowledge. The fact she is taking part in JW 21/01 is sensitive until it is published by the MOD. So, that means the "Exercise participant" value for QE/JW21/01 is "Private" until the date of the news release when it becomes "Public". Some of the Serials in JW21/01 may be public, since the MoD may have had to send out messages to fishing craft / aircraft about if there is going to be some GPS Jamming (for instance). But, others won't be shared, nor will the details of who is taking part in which serials.

The above is quite a bit of juggling, but it's quite familiar to the analysts.

robintw commented 3 years ago

Your example makes sense - lets go forward with privacies for everything.

rnllv commented 3 years ago

Okay @robintw, @IanMayo.

Any other fields to be added?

IanMayo commented 3 years ago

Here is the schema: image

We need environment and location for Serials, we need Force for Serial_Participants

rnllv commented 3 years ago

Sure @IanMayo. Will include these fields. Also, the new default data state in pepys."PlatformTypes" and frequency in pepys."Series" are interval datatypes. We don't have that in SQLite.

IanMayo commented 3 years ago

How about DefaultDataIntervalSecs as an integer?

rnllv commented 3 years ago

Got it.

robintw commented 3 years ago

Yes, I think just handle it as seconds in an integer field - but most other field names we have are lower-case with dashes separating words, so maybe default_data_interval or default_data_interval_secs?

rnllv commented 3 years ago

default_data_interval_secs will be descriptive and preferred SQL way.

rnllv commented 3 years ago

To be consistent with the other tables, I'll be going with camelCase naming for table names.

rnllv commented 3 years ago

@IanMayo, what should be the datatype for start and end. Should we retain it as "Timestamp without timezone" or follow Postgres recommendation?

IanMayo commented 3 years ago

While data may come from different timezones, all systems capturing maritime data run at UTC, which effectively makes the TimeZone of no value.

There may be an argument in the future for changing to allow TimeZone, but this issue isn't a sufficient reason to move away from what's followed in the rest of Pepys database.

rnllv commented 3 years ago

Fine

rnllv commented 3 years ago

Here's the updated sqlFiddle [Privacy field missed. Updating in sometime]

Here's the ER diagram.

image

IanMayo commented 3 years ago

Sorry I didn't mention this earlier @arunlal-v - we don't need fruqency_days in Series. It doesn't bring any value to the users.

rnllv commented 3 years ago

No issues @IanMayo

Here's the updated SqlFiddle

And here's the ER Diagram.

image

rnllv commented 3 years ago

@IanMayo, I'm guessing the answer for my following question is "No". But just wanted to get it confirmed by you.

There will be only one serial happening at any given time for a exercise, right?

rnllv commented 3 years ago

And should we consider multiple serials in a day?

IanMayo commented 3 years ago

There will typically be up to 5 serials per day.

The serials can overlap, though a participant can only be in one serial at a time.

While the "big players" may all do a fast-paced one-hour serial where the aircraft carrier gets attacked by aircraft, some other units may be doing a 6 hour serial that involves a slower paced undertaking such as a commando reconnaissance patrol.

robintw commented 3 years ago

@IanMayo You said:

I'm happy for the users to an Other Series, into which ad-hoc exercises are stored. If it becomes unweildy they have the tools to separate Other down further - either by year or exercise type. That's their core housekeeping responsibility.

I agree, that will be very helpful for me from the UI perspective. Should we create this Other series as part of the default metadata that can be added when the database is created? Then we can always assume it exists, and add things to it if necessary.

IanMayo commented 3 years ago

I don't think we need to create the Other series. The users can create their own top-level series - and add wargames to it.

and add things to it if necessary.

I can only picture Pepys adding wargames to a series by a user first selecting the series (maybe after creating it), and then adding the wargame.

rnllv commented 3 years ago

@IanMayo, I believe we'll be needing one or more of the following fields in pepys."Serials". Kindly confirm.

base_identifier text - to capture J in J05324 day_of_month integer - to capture 05 in J05324 serial_index integer - to capture 324 in J05324

In the dashboard picture shared in issue 840, I see the serial name as CASEX 324, where as in the examples you have shared as comments in other replies in the same issue has J05324.

rnllv commented 3 years ago

@IanMayo, with reference to this comment, what serial name should the dashboard display if there are multiple serials for the previous day?

IanMayo commented 3 years ago

@IanMayo, I believe we'll be needing one or more of the following fields in pepys."Serials". Kindly confirm.

No, these serial values are human generated. We don't need to produce or interpret them.

There are two names for a Serial. The J06230 is a human-created value that is used to uniquely refer to each part of the wargame. The CASEX 324 is the name of a pre-scripted exercise contained in a book held by the participants. The CASEX 324 may occur several times across the wargame. So, it's useful for the participant to know what they're going to be doing, but doesn't uniquely identify a serial/segment/component.

That is why both are needed, and why analysts find the CASEX 324 field useful in addition to the exercise.

Here is where @robintw added the Exercise field: https://github.com/debrief/pepys-import/issues/836

Based on the above, I think we need @arunlal-v to modify the table design:

  1. Rename exercises to wargames
  2. Rename serial.name to serial.serial_number and introduce an optional exercise text field.
IanMayo commented 3 years ago

@IanMayo, with reference to this comment, what serial name should the dashboard display if there are multiple serials for the previous day?

The Serial (J06230) will always be displayed - since they are unique for this wargame. But, analysts will find value in the exercise label too (CASEX 324) - so we'll display both.

IanMayo commented 3 years ago

Aah, @arunlal-v - I realise we need another field for SerialParticipants. The participant will be playing as either Blue Force (friendly) or Red Force (enemy). One day we may need to support White Force (non-combatant).

I think this could be met through a Force Types table, containing:

This will allow a compulsory Force field in the SerialParticipants table.