Closed mgwalker closed 5 years ago
We have 19 tables to handle the various parts of the APD, largely because of the many-to-one nature of the various APD properties (e.g., many contractors to one activity, many activities to one APD, etc.). However, we never address those various properties independently - we only fetch the entire APD and we only update the entire APD (an oversimplification, but effectively true). As a result, our data model is more complex than we actually need - and that comes with a development and maintenance cost.
Adding the ability to comment on individual sections of an APD would add more complexity. We would either need distinct comment tables for each section, or a single comments table that all of the others can reference. We would also need to update all of the section tables to include at least one reference into their associated comment table, and possibly multiple references (e.g., if we want field-level commenting, we would need one reference per field). The data relationship gets more complex. Repeat the entire process in order to be able to assign sections to users for their review.
Finally, tracking changes over time in a relational data model is often done with "history" tables - structurally equivalent to the live tables, but where "old" values are written before the "new" values are put into the live table. There's also the option of a single "changes" table where you record the old value along with what table and column it came from. Both of these changes are cumbersome, and while they are technically capturing changes over time, they do not do so in a way that makes it easy to build an APD from a past state.
All of these things are true because we're using a relational data model to describe the APD. A relational model is ideal when you have multiple types of things that are related together, and you want to be able to interact with and changes those various things independently. Because we treat the entire APD as a single unit, the relational model doesn't make sense.
The primary alternative is the document model (not to be confused with the document object model; thanks for the unique names, computer science!). In this approach, a "document" is treated as a single entity and is fetched and manipulated as a whole. Fortunately for us, PostgreSQL - the database engine we're using - has built-in document support in the form of JSON. (And anecdotally, it appears to perform as well or better than many of the databases that only support document models.)
JSON Patch (RFC 6902) is a standard format for describing changes to a JSON document. Switching to a JSON document model would enable us to use this standard for change tracking over time. We could also switch to it for saving changes from the frontend which would be a nice consistency win.
For example, we can update the name of the third contractor in the first activity:
{
"op": "update",
"path": "#/activities/0/contractor/2/name",
"value": "New Name"
}
Because a document model isn't tied to a database schema (though schemas can still be enforced in code with JSON Schema), it's much more flexible, so we can add APD metadata very easily - such as adding comments to each individal textfield.
{
"id": 3,
"name": "MP-2019-07-25-HITECH-APD",
"activities": [
{
"alternatives": "",
...
...might become...
{
"id": 3,
"name": "MP-2019-07-25-HITECH-APD",
"activities": [
{
"alternatives": {
"comments": [
{
"author": "user id",
"timestamp": "Tuesday",
"value": "A comment"
}
],
"value": ""
},
...
gives us language for highly-specific validation messages from the API Using the same JSON Pointer (RFC 6901) standard as JSON Patch, we can be very explicit about which field(s) are failing validation
{
"path": "#/activities/0/schedule/2/endDate",
"value": "End date is after the activity has finished"
}
(To be clear, switching to a document model neither gives us validation nor does it give us very-specific validation. It just provides us a data format for being specific.)
In any case, supporting comments and assignment will require work on the data model. If we choose to stick with the relational model, that will mean creating more tables, building the relationships between them, updating our models in code to describe the right relationships, creating new models in code to describe the comments (etc.), and I'm not even sure how we'd write the endpoints for adding comments (maybe something like POST /comment/apd/:apdID/activity/:activityID/contractorResource/:contractorResourceID
; it's very explicit, but quite gnarly and would put a pretty significant burden on the frontend to build proper URLs).
Switching to a document model is not a quick effort. There are a lot of moving pieces to consider. Here's my initial estimate of how the work might be sequenced:
add a data migration to add the document
column to the apds
table, and convert existing APD data into that document
update the POST /apds
and PUT /apds/:apdID
endpoints to also update the document; leave existing functionality intact
These two steps would be merged into master as a single pull request. They set the groundwork for everything that's to come after. At this stage, we have two copies of every APD stored in the database - one in the relational model and one in the document model. Everything still uses the relational model as the source of truth.
update the GET /apds
and GET /apds/:apdID
endpoints to return the APD from the document model
This would be the next pull request. We would want thorough tests before merging to make sure the frontend still works as expected, especially around being able to save changes. At this poit, the document model becomes the source of truth.
add a new PATCH /apds/:apdID
endpoint that accepts a JSON Patch object describing how to change the document; only updates the document model, not the relational model
This is setting the stage for the frontend to be able to switch over, but we want to get this in sooner rather than later so we can start testing it and make sure it works as expected. This endpoint should be disabled in production because it could cause data corruption in later steps if it's buggy here.
This endpoint needs to do some light validation of the model (not necessarily the data, though eventually that too) to ensure that arbitrary fields aren't added to the database.
update the frontend to capture changes as JSON Patches in addition to the dirty tracking we have now
This can begin at any time, including in parallel with the first step. We want it in production for a bit before switching to it exclusively, though, so we can test it out and make sure it's consistently generating the correct patch data.
enable the PATCH /apds/:apdID
endpoint in production
update the frontend to use the PATCH /apds/:apdID
endpoint instead of the PUT
endpoint
Everyone's using the document model at this point. Nearly there!
remove the PUT /apds/:apdID
endpoint
update the POST /apds
endpoint to only create the document model instead of also the relational model
add a data migration to remove the 18 additional APD tables
add a data migration to remove the extraneous id
fields in the APD sub-properties
These can all be a single pull request, or a few. In the last step, we basically clean up the document model to remove the remnants of the relational model. We cannot remove these extranous
id
s until the frontend is updating with patches - prior to that, it relies on theid
fields of the sub-properties to do partial APD updates.
recommend a tech debt spike to remove Bookshelf.js ORM and switch the remaining code models to use knex.js directly (Bookshelf.js is a layer on top of knex and adds a fair bit of complexity that we don't need if we're not using relational models)
@jeromeleecms @NAretakis I'll plan on getting some time during the working session to pitch to the team and/or answer any questions so you can decide which route you'd prefer (or if you want us to pause on this altogether for now while we settle the roadmap).
@mgwalker Thank you so much for such an excellent write-up! I think this switch might also mean a couple other things:
I wonder if we undertake this, should we also harmonize the frontend and backend document model? Right now we have serialization/apd.js
and serialization/activity.js
in the frontend to convert back and forth between the two models. The document model generated by the API is based on the relational model backing it up. E.g., the API document model has this:
{
"keyPersonnel": [{
"name": "Person's name",
"costs": [
{ "year": 2019, "cost": 123 },
{ "year": 2020, "cost": 456 }
]
}]
}
The corresponding frontend model is this:
{
"keyPersonnel": [{
"name": "Person's name",
"costs": {
"2019": 123,
"2020": 456
}
}]
}
Having the same model on both sides would make computing changes easier, and also make it much easier to ship changes from one user to another, which would be pretty slick.
Tracking some of our discussion before I forget that I ALREADY asked this:
1) Does moving to this model affect our ability to perform analytics? 2) Does this model affect our ability to separately pull / merge separate activities? 3) I'm not seeing a lot of cons in the write up -- is there something we should be aware of?
It shouldn't affect our ability to do analytics on existing data, and could end up creating new data for analytics. It might change how we to do the analytics (instead of a typical SQL query, you might need a JSON query, or a SQL query that feeds JSON into R or Matlab or something like that).
No. Currently we address a specific activity by its ID in the activities
table. Under this model, we'd address a specific activity by its JSON Pointer path. We could still copy those around between APDs. As far as merging... I'm not sure what that means, but it sounds hard either way. 😆
I think the biggest con is the risk of losing data somewhere along the way as we do the migration. Especially for pilot states that are entering their real APDs right now - we don't want them to have to do that all over. We can mitigate some of the risk by moving in small steps and doing lots of tests after each one to make sure everything is still on track.
For (2) I was thinking that merging/pulling would be similar to have a 2 APDs that laid out 10 activities each and eventually merging them into a single APD Update (pull in activities 1,3,4 from one and 5,8,9). This is not super common now, but since we are being more activity specific, I'm thinking that it may be something people want in the future -- i.e. not having to update two separate APDs when the other activities don't need anything.
Also, from a operational/oversight perspective, I might want to know: what are all the activities we've approved for State A? Which ones are currently active? etc.
Second question first: we can definitely do that!
The first one has lots of implications. I wouldn't feel comfortable saying we could or couldn't do that without a more in-depth conversation about what all that would involve. My instinct is that copying activities from 2 APDs into a single APD would probably be pretty straightforward with the document model and complex with the relational model, but there's a lot of user interaction and expectation to consider, I think.
Closing because we're going to proceed with the document model, and that's how we expect to track changes.
This is just a sub-component of #1692, so think of it more as a note-keeping issue to keep the main issue from getting cluttered.
Automatic saves
Should the app automatically save? If so, when?
Initial proposals
Tracking changes - backend
As changes are made, how should the backend determine when to create a new version? How does it track what changed between versions?
Initial proposals
Tracking changes - frontend
Haven't thought about this one much yet...