FreeUKGen / FreeBMD2

For everything related to FreeBMD2. An updated version of the original FreeBMD genealogy website.
Apache License 2.0
1 stars 0 forks source link

Decide on the form of FreeBMD2 urls #385

Closed PatReynolds closed 1 year ago

PatReynolds commented 3 years ago

Was part of #87

After the redirect of FreeBMD1 citation URLS to FreeBMD2 (story #384) is complete decide whether FreeBMD2 urls used in citation generator will be

The freeBMD1 version of the URL (will direct to FreeBMD2 after www / beta is changed to freebmd1/www) OR The (preferably) the FreeBMD2 prettified URL

richardofsussex commented 2 years ago

See #276. I'm suggesting that FreeBMD URLs should have the canonical form:

https://freebmd.org.uk/search_records/MDLSyeQhqIQQVpPJz7ZNlw/

I'm happy for us also to generate a variant URL with human-friendly details on the end, so long as both variants map to the same page/record. Having invented said friendly variant, I can't see any reason not to use it in the citation generator.

richardofsussex commented 2 years ago

Having read about the 'resource' concept in Rails, I'm wondering if we can do better than this. My proposal is now:

https://freebmd.org.uk/gro/[event_type]/[uuid] e.g. https://freebmd.org.uk/gro/birth/MDLSyeQhqIQQVpPJz7ZNlw/

The reason for including '/gro' in the path is that we may acquire BMD data from other sources in future. Declaring a Rails resource gives you the ability to create pages to create, edit and delete instances of that resource type. Non-GRO BMD records may have a different data structure/requirements.

URLs of this form can be redirected to e.g. https://freebmd.org.uk/search_records/MDLSyeQhqIQQVpPJz7ZNlw/ since the UUID is unique within the FreeBMD database.

We could apply the same approach to REG and CEN, since the more meaningful URLs could be redirected to the currently supported 'search_records' path.

PatReynolds commented 2 years ago

Also discussed via email: the FreeBMD1 emails are only quasi-permanent hashes: as we enhance records, they become unusable.

Therefore a truely permnent unique reference is needed, plus a referal from known Freebmd1 hashes to that url.

richardofsussex commented 2 years ago

There are two issues here, one of which probably ought to be in a separate thread. We can decide on the preferred form of URLs in an abstract way, e.g. https://freebmd.org.uk/gro/birth/:id/ without deciding exactly how :id is obtained. Then, as you say, we have the task of deciding how to mint unique persistent identifiers to be the :id in the URL.

It would be helpful to have a sense of what proportion of BMD1 hashes become invalid over time. It's a problem we already have: the URLs which appear on the BMD1 site can be quoted as sources (I routinely use them on WikiTree), and any of them may become invalid as the result of any update. So, arguably, by adopting the BMD hashes as our :id, we aren't making the situation any worse than it currently is. In fact, if we were to migrate from SQL to MongoDB, and include the hashes which were generated by the 'last' BMD1 update as primary keys, we wouldn't have the problem at all. (Or, at least, we wouldn't make it any worse.) However, this makes the big assumption that we carry out all subsequent updates on the MongoDB data, using an update system which has yet to be written.

richardofsussex commented 2 years ago

Migrating from Freebmd1 hashes to permanent UUIDs is a separate issue from the form of the URLs we generate. I would expect us at some point to migrate from MySQL to MongoDB, and 'freeze' the hashes in that version of the database as the permanent identifiers.

richardofsussex commented 1 year ago

Having implemented the redirection of BMD1 URLs to BMD2 equivalents, I now think that we should develop a system for URLs which accepts the reality that we do not have permanent identifiers for entries. What we do have are record numbers, which are specific to a particular iteration of the BMD database, and hashes, which remain persistent while the underlying data stays the same. So, of these two types of identifier, the hash is the obvious one to use.

My proposed strategy for dealing with hashes which become invalid is to reconstruct the search which led to the record in question. This can only be achieved by knowing the search criteria. In other words, the 'prettified' URL becomes an essential part of our strategy for dealing with non-permanent identifiers. Logically, this means that we should use the 'prettified' form of the URL in the citation generator.

Another strand to this strategy is to have a lookup table mapping invalid hashes to their replacement values, and to use this lookup silently whenever a hash lookup fails before reporting an error.

richardofsussex commented 1 year ago

The "reconstruct the search from the prettified URL" strategy will only work if the search elements can be reliably extracted from the prettified URL. We need to ensure that the hyphen-separated format is amenable to automated parsing. For example, in the FreeREG example, a date appears as "yyyy-mm-dd" within the URL, implying three items of information, not one.

PatReynoldsFUG commented 1 year ago

Sorry, Richard, I hadn't gathered that you were planning for a machine to reconstruct the search! The current form was designed for a human to reconstruct the search. And also to be interpreted by a search engine (so "John" "Smith" "Scarborough" "birth" "1832" would need to be recognisable) - Ben was the developer involved.

richardofsussex commented 1 year ago

Well, I'm just thinking through the possibilities. The current approach of outputting a hyphen-separated sequence of fields is fine for my purpose, so long as (a) the order of fields is known and consistent for each project and (b) hyphens are only used as separators, i.e. hyphens in the data are converted to something else, e.g. underscore.

What I have in mind is that when a hash value fails to resolve to an active record, the system will return the user to the search form, and will fill in the elements of the search from the prettified elements in the URL. It will also put up a flash message explaining that the hash is no longer valid, and inviting them to tweak the search and re-run it. The gotcha is that, since the data has (by definition) changed, that search as it stands is unlikely to succeed. So the user will have to relax/change it until it does find the record. So it's a team effort between the software and the user, but at least it saves them the irritation of having to enter all the search criteria by hand.

DeniseColbert commented 1 year ago

Decision: move forward with Richard's suggestion

DeniseColbert commented 1 year ago

Done, closing