We should drop the surrogate id key from base fields, and use the natural short_code key everywhere.

Background

Currently, the base fields table looks like id integer primary key, short_code text unique not null, created_at.... Other tables that refer to the base fields table do so via the integer id. The integer id is also exposed in several endpoints, including GET /baseFields/, POST /applicationForms/ and PUT /baseFields/:id.

Using a surrogate key (aka artificial key or synthetic key) is a trade-off. In general, some of its benefits include that it improves the performance of JOINs, and limits the size of database tables with foreign keys. However, surrogate keys are (by definition) opaque, and so they reduce legibility anywhere the surrogate ID is used: id=132 has no meaning to a reader, but short_code=proposal_budget does.

The Problem

In the case of base fields in the PDC, the surrogate ID is both opaque and unstable.

It is opaque in that using the surrogate ID makes it more difficult to read the database, as you need to do additional joins to get the short code. If the foreign key used the short code, it would become much clearer. Likewise, the few URLs we have that include the base field ID convey very little meaning in themselves - just that it will give some information about some base field - whereas a URL that included the short code would more clearly communicate its purpose.

It is unstable in that two instances of the PDC service with the "same" base field (meaning the same short-code and, presumably, label) can have different surrogate IDs, depending on the order the base fields were added.

As a result, clients need to work more-or-less exclusively with short codes. A client cares about specifying the proposal_name, not specifying base field id 103. That means that their workflow always has to include looking up the list of base fields and finding the relevant ID so that they can feed it back into the system elsewhere.

For example, the front-end needs to show a preselected list of base fields as column headers. It must list those fields by short code, since it can't guarantee what ID the API will use. Then, it has to find the relevant base field from the id-sorted list returned by the API to find the label & description. This leads to some pretty convoluted code, such as the getValueOfBaseField function or whatever this thing I wrote does.

As another example, the code for #756 will have the same concern: it cannot assume that the base fields in the target instance have the same IDs as the source instance, so it will need to act by short code.

The front-end, and probably every other client that works with the PDC, would be much simpler if it could just use short codes everywhere.

Further, the primary benefit of using surrogate keys - performance - isn't relevant at our scale, and is unlikely to be any time soon. So we have the drawbacks, but not the benefits.

Mutability

Theory

One of the arguments sometimes used in favor of surrogate keys is the concern of updating the data: what do you do if the data you need to update is (part of) the primary key?

Database-wise, that's not a problem; we can simply specify ON UPDATE CASCADE on our foreign keys, and when you update the short code, the database will update the short code in each reference.

URLs

The other concern around mutability is the changing of URLs, and in particular PUT /baseFields/:id and (soon) PUT /baseFields/:id/:lang. Currently, those endpoints are only available users with administrator access, and we can just talk to each of those few people in the uncommon case of renaming a short code. Even with that mitigation aside, I believe that legible URLs are more valuable than forever URLs (at least in this case - I'll have to reflect if I believe that more broadly!).

Practice

While I anticipate that we will occasionally need to change a short code, I suspect it will be a rare occurrence. Changing a base field that is in use - regardless of the key strategy - will change the meaning of every application and proposal that refers to it. Therefore, I expect such changes to be limited to fixing typos and so on; changes that don't affect the meaning of the field.

Refactoring

The process is slightly tedious but not at all difficult; something like:

create new FKs pointing to base_fields (short_code) alongside existings FK pointing to base_fields (id)
populate the new column
make the new column NOT NULL, if needed
drop the old FK column
set the short_code to be the new primary key
drop the id column

Broader implications

I am not suggesting that we use natural keys everywhere in the PDC. I'm not sure where else the opportunity even exists, and if and when we identify one, I'd want to take it on a case-by-case basis. I think it's a clear win with base fields, but that doesn't mean natural keys would necessarily be better elsewhere.

PhilanthropyDataCommons / service

Use natural key `short_code` as primary key for base fields #1141