PhilanthropyDataCommons / service

A project for collecting and serving public information associated with grant applications
GNU Affero General Public License v3.0
8 stars 2 forks source link

Use natural key `short_code` as primary key for base fields #1141

Open jasonaowen opened 2 months ago

jasonaowen commented 2 months ago

We should drop the surrogate id key from base fields, and use the natural short_code key everywhere.

Background

Currently, the base fields table looks like id integer primary key, short_code text unique not null, created_at.... Other tables that refer to the base fields table do so via the integer id. The integer id is also exposed in several endpoints, including GET /baseFields/, POST /applicationForms/ and PUT /baseFields/:id.

Using a surrogate key (aka artificial key or synthetic key) is a trade-off. In general, some of its benefits include that it improves the performance of JOINs, and limits the size of database tables with foreign keys. However, surrogate keys are (by definition) opaque, and so they reduce legibility anywhere the surrogate ID is used: id=132 has no meaning to a reader, but short_code=proposal_budget does.

The Problem

In the case of base fields in the PDC, the surrogate ID is both opaque and unstable.

It is opaque in that using the surrogate ID makes it more difficult to read the database, as you need to do additional joins to get the short code. If the foreign key used the short code, it would become much clearer. Likewise, the few URLs we have that include the base field ID convey very little meaning in themselves - just that it will give some information about some base field - whereas a URL that included the short code would more clearly communicate its purpose.

It is unstable in that two instances of the PDC service with the "same" base field (meaning the same short-code and, presumably, label) can have different surrogate IDs, depending on the order the base fields were added.

As a result, clients need to work more-or-less exclusively with short codes. A client cares about specifying the proposal_name, not specifying base field id 103. That means that their workflow always has to include looking up the list of base fields and finding the relevant ID so that they can feed it back into the system elsewhere.

For example, the front-end needs to show a preselected list of base fields as column headers. It must list those fields by short code, since it can't guarantee what ID the API will use. Then, it has to find the relevant base field from the id-sorted list returned by the API to find the label & description. This leads to some pretty convoluted code, such as the getValueOfBaseField function or whatever this thing I wrote does.

As another example, the code for #756 will have the same concern: it cannot assume that the base fields in the target instance have the same IDs as the source instance, so it will need to act by short code.

The front-end, and probably every other client that works with the PDC, would be much simpler if it could just use short codes everywhere.

Further, the primary benefit of using surrogate keys - performance - isn't relevant at our scale, and is unlikely to be any time soon. So we have the drawbacks, but not the benefits.

Mutability

Theory

One of the arguments sometimes used in favor of surrogate keys is the concern of updating the data: what do you do if the data you need to update is (part of) the primary key?

Database-wise, that's not a problem; we can simply specify ON UPDATE CASCADE on our foreign keys, and when you update the short code, the database will update the short code in each reference.

URLs

The other concern around mutability is the changing of URLs, and in particular PUT /baseFields/:id and (soon) PUT /baseFields/:id/:lang. Currently, those endpoints are only available users with administrator access, and we can just talk to each of those few people in the uncommon case of renaming a short code. Even with that mitigation aside, I believe that legible URLs are more valuable than forever URLs (at least in this case - I'll have to reflect if I believe that more broadly!).

Practice

While I anticipate that we will occasionally need to change a short code, I suspect it will be a rare occurrence. Changing a base field that is in use - regardless of the key strategy - will change the meaning of every application and proposal that refers to it. Therefore, I expect such changes to be limited to fixing typos and so on; changes that don't affect the meaning of the field.

Refactoring

The process is slightly tedious but not at all difficult; something like:

Broader implications

I am not suggesting that we use natural keys everywhere in the PDC. I'm not sure where else the opportunity even exists, and if and when we identify one, I'd want to take it on a case-by-case basis. I think it's a clear win with base fields, but that doesn't mean natural keys would necessarily be better elsewhere.

Further reading

slifty commented 2 months ago

I think it might be useful for us to have a real time discussion where we go over this in fuller detail; that said I'll share some of my initial thoughts:

  1. I agree that the DB impacts around query speed, storage size are not compelling reasons to keep surrogate keys. I would want to better understand the ramifications of ON UPDATE CASCADE as it relates to cost of updates (table locks, indexes, and risks of introduced inconsistent states) when there are tens of thousands of relationships that refer to those changing fields.

  2. I think there are some significant concerns with mutability that are not currently represented in the above analysis, specifically as it relates to third party clients of the PDC that will not benefit from ON UPDATE CASCADE to maintain their own internal associations with our base fields and caches of relevant PDC data. This includes research use cases and analysis of PDC data over time which would likely rely on being able to consistently compare base field data even after changes to labels and short codes, but also GMS associations with our data in the context of application form creation, and third party data provider integrations which may rely on cached references to our base fields when mapping their data to ours.

  3. I do have more abstract concerns about an API design that allows a PUT to fully invalidate the URI being PUT to (e.g. not providing a redirect).

For instance, it feels quite un-restful to have a situation where:

PUT /baseFields/foo => { body that changes shortCode to "bar" } would result in a 204 which returns an entity that is no longer associated with the original URI

and in which following up with

PUT /baseFields/foo => { body that changes shortCode to "baz" } would result in a 404.

I'm not entirely sure what the ramifications of deviating from REST best practice would be, but imagine there is probably something? (This could impact cache invalidation? entity discoverability? unexpected logical flow for clients making a series of API calls related to a given base field? something else?)

  1. Regarding the benefit of human readable URLs I will note that we could theoretically support both approaches e.g. PUT /baseFields/id/1 and PUT /baseFields/shortCode/foo