Closed NoisyFlowers closed 6 months ago
The most obvious question here is why we are storing duplicated data in the first place. The answer to that is lost in the mists of time. I think we expected an azgs.json file to accompany each collection when it was first uploaded to the database, which we could then use to properly create the collection in the database.
At some point, we decided to store that complete metadata in the db as well. Then, in the api, we decided to bundle it with every collection that is downloaded and to provide it via a metadata/collectionID endpoint. These actions aren't necessarily wrong, except that the metadata is then sent back through the api if the collection is edited, and used to update the collection in the db. If the values in that metadata were not kept up to date in the db, this gets propagated back into other tables the db when the update occurs.
Here is one proposal: remove all duplicate values from the json, and regenerate them when needed. Code review suggests there would be two parts to this:
1) In azlibAdd, there is the following upsert:
const insertSQL = `insert into public.collections (
private,
formal_name,
informal_name,
azgs_old_url,
ua_library,
collection_group_id,
perm_id,
supersedes)
values ($1, $2, $3, $4, $5, $6, $7, $8)
on conflict (perm_id) do update set
private = $1,
formal_name = $2,
informal_name = $3,
azgs_old_url = $4,
ua_library = $5,
collection_group_id = $6
returning collection_id, perm_id, (xmax=0) as inserted`;
The values for all these fields come from the json. After this executes, remove all of those values from the json before it is stored in the db.
2) In the api, pull these values from public.collections and add them to the json before it is sent to the client. In collections.model.js, this would happen where the following sql executes:
return db.one(`
select
json_data
from
metadata.azgs
where
collection_id = (
select
collection_id
from
public.collections
where perm_id = $1
)
`, [collectionID]);
In metadata.model.js, this would be anywhere the json_data is queried.
Another option would be to keep the duplicate values and use triggers on updates to public.collections that set the dup values in metadata.azgs.json_data.
This might be easier, contextually, as it keeps the metadata always up to date and ready to go, without needing further query shenanigans.
The key to this approach is to insure public.collections is the authoritative source of these fields. From what I can tell, setting triggers on values within jsonb is either not possible or so hair-brained that nobody is talking about it. So the sync direction must be public.collections to metadata.azgs.
The only times json metadata values would be authoritative is when a new collection is uploaded or a collection is modified through the api. In the first case, the collection will contain an azgs.json file we use to initialize fields in public.collections. In the second, we use the azgs json schema to pass updated values to the api server. In both cases, we already sync the values in public.collections to the json before storing it, so there's nothing more to do.
I prefer the first option of just getting rid of the json, but I want to really be sure that we past us didn't have reasons for keeping it other than the bulk upload. Is it possibly used in some of the search functionality?
After working on the new lineage stuff, it's clear to me that we intended metadata.azgs.json_data to be the authoritative source rather than public.collections. We already have a trigger in place that updates elements of public.collections whenever the json changes. Rather than rock the boat, I decided to stay with that approach for the lineage changes. So I modified the trigger to update the lineage table based on the json data.
My opinion is that there is not a strong enough reason at this time to warrant any of the changes described above. A safer approach is to just make sure that trigger is doing its job. I think it is at this time.
I am fine with this conclusion then.
Currently, azgs.json is stored in metadata.azgs.json_data. It contains several fields that are also stored elsewhere in the database. We have seen problems when these fields become out of sync. Investigate ways of handling these sync issues.