azgs / azlibrary_database

1 stars 1 forks source link

Encoding error in collection group 5 #34

Closed aazaff closed 4 years ago

aazaff commented 5 years ago

The collection_group_name for collection_group 5 reads as Troy L. PŽwŽ Environmental Geology Collection. It should read as the Troy L. Péwé Environmental Geology Collection. I am not sure what the best procedure will be to fix this in such way that it introduces no breaking changes to our metadata.

NoisyFlowers commented 4 years ago

The first thing we need to do is correct the characters in the public.collection_groups table.

-- To correct bad characters in collection_group 5
update 
    public.collection_groups 
set 
    collection_group_name = 'Troy L. Péwé Environmental Geology Collection'
where 
    collection_group_id = 5;

Next, we need to fix any collections that contain these invalid characters. Outside of the metadata in the json_data column of metadata.azgs, we only reference collection_group via fk relationships to collection_group_id in public.collections. When someone requests the collection for download, the azgs.json file is generated on-the-fly from this metadata. So, I think that the json_data column is the only place we need to fix things. I have tested the following sql locally.

-- Also correct any bad metadata that has resulted from this error
update 
    metadata.azgs
set
    json_data = jsonb_set(
        json_data,
        '{collection_group, name}',
        '"Troy L. Péwé Environmental Geology Collection"',
        false
    )
where
    json_data->'collection_group'->'name' = '"Troy L. PŽwŽ Environmental Geology Collection"'

Because this sql must be applied to all collections, I want more eyes to review it before I take it beyond my local machine. Also, if anyone is aware of anywhere else these characters might be, please let me know.

aazaff commented 4 years ago

Makes me wonder if there should be some kind of automated mechanism to re-generate the json column in case if we make any batch changes like this?

NoisyFlowers commented 4 years ago

I'm not sure what that would look like. The json column is the canonical source of most of what it contains. We could regenerate the files list. But there's nowhere else to look for the other stuff.

I really don't like mucking around with this json column en masse like this. Makes me tense.

NoisyFlowers commented 4 years ago

I went ahead and ran those sql statements on dev.

Only three records were effected. I verified each of them before and after.

NoisyFlowers commented 4 years ago

This is in ed4a01cd1ddcfc5069e058a3f001a6a8e032cb10

NoisyFlowers commented 4 years ago

This update is running on prod (as of 8/19/2020).