FreeUKGen / MyopicVicar

MyopicVicar (short-sighted clergyman!) is an open-source genealogy record database and search engine. It powers the FreeREG database of parish registers, the FreeCEN database of census records, the next version of FreeBMD database of Civil Registration indexes and other Genealogical applications.
45 stars 15 forks source link

Implement pretty/persistent record URLs #1144

Closed PatReynolds closed 5 years ago

PatReynolds commented 7 years ago

An easier implementation than persistent urls Depends on #1470 Depends on #1435 needs #1623

Captainkirkdawson commented 7 years ago

How does this relate to #801?

PatReynolds commented 7 years ago

My understanding is that it provides a cheaper alternative. It's 1.7 so doesn't need consideration for some time.

On 17 March 2017 at 14:06, Kirk Dawson notifications@github.com wrote:

How does this relate to 801?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-287362840, or mute the thread https://github.com/notifications/unsubscribe-auth/AGC5BhgEGDha1Rtps2tEFeikSgHt3w0nks5rmpN7gaJpZM4MghfU .

-- - -

Dr Pat Reynolds Executive Director Free UK Genealogy http://www.freeukgenealogy.org.uk/ A Charitable Incorporated Organisation registered in England and Wales, number 1167484 VAT registration: 233 0105 70

​+44 ​1904 541411​ +44 7943 145387

36 Albemarle Road, York, YO23 1ER, UK

Captainkirkdawson commented 7 years ago

One of my concerns is why 2 stories for the same thing if that is what they are

PatReynolds commented 6 years ago

@benwbrum is the underlying record url persistent?

benwbrum commented 6 years ago

We have an agreed-upon plan for URLs. The friendly-url gem is a good option for doing this. Ready to build.

benwbrum commented 6 years ago

See https://github.com/FreeUKGen/FreeUKRegProductIssues/issues/801 for plans. (cc @richardofsussex)

benwbrum commented 6 years ago

Further research has revealed that the friendly_id gem does not work with MongoDB. I'm working on a hand-built implementation instead that I think should work.

Captainkirkdawson commented 6 years ago

How does this relate to 1435 search_id should not be required to view record

benwbrum commented 6 years ago

Thanks, Kirk -- I've moved #1435 to "in progress", since that's what I"m actually working on at the moment.

PatReynolds commented 6 years ago

I have added an 'urgent' tag to reflect need to make GDPR requests easier, and to increase visibility / income generation.

PatReynolds commented 6 years ago

needs #1435

benwbrum commented 6 years ago

I hope to have this finished this sprint.

On Tue, Jun 5, 2018 at 5:56 AM, PatReynolds notifications@github.com wrote:

I have added an 'urgent' tag to reflect need to make GDPR requests easier, and to increase visibility / income generation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-394667217, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGWwgvHELHPyZSicyn49e6ud4duT5ks5t5mPpgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

PatReynolds commented 6 years ago

dependent on #1470

richpomfret commented 6 years ago

An example of a friendly URL for FreeREG: http://localhost:3000/search_records/58ea2b94a020dd02bffac8ec/john-whittle-burial-staffordshire-wolstanton-1637-03-10?search_id=5b17f0a8a020dd665730cbff&ucf=false and http://localhost:3000/search_records/58ea2b94a020dd02bffac8ec/john-whittle-burial-staffordshire-wolstanton-1637-03-10

richpomfret commented 6 years ago

120 year cut-off discussed in meeting - @benwbrum to analyse what we have to get an estimate.

Captainkirkdawson commented 6 years ago

In #1470 @benwbrum commented. "This only relates to changes in development to make URLs more friendly to search engines and those end users who download and save links to our records." @Captainkirkdawson asked. "Do not understand how this makes a search record more friendly to search engines." To which @benwbrum responded "I don't understand the mechanics, either, but did a great deal of research on the subject, and this appears to be helpful." Followed by my comment. "Since search engines have no access to search records then the only way would appear to be if the pretty url was stored in a database by the user and that was accessible. Seems a long shot to me. But I defer to others." To which @richardofsussex seemed to respond "How about using XML sitemaps (https://support.google.com/webmasters/answer/183668?hl=en) as our SEO mechanism, and leaving the URLs opaque?"

My question is Since our sitemap currently currently forbids Search Engines from accessing the search records are we proposing that when friendly URLs are implemented we will be permitting the search engines to trawl through and index the search records? Given the proposed nature of the friendly url why will people use our site if they have the basic answer in google?

benwbrum commented 6 years ago

This is absolutely part of a plan to allow search engines to crawl those entries which are available under open data. Currently such records are limited to Census records from Cornwall, as we do not yet have a mechanism for tracking which records have been contributed to other counties or systems by transcribers who have agreed to share the data online.

In my opinion, in an ideal world, a beginning genealogist would be able to type "Robert Thornton Yorkshire Baptism" into Bing or Google, and see listed (prominently, even!) one of our records, which they could click on and access our site. This would attract more attention (and possibly even volunteers) by researchers who are unaware of Free UK Genealogy and increase our advertising revenue.

On Thu, Jun 7, 2018 at 9:44 AM, Kirk Dawson notifications@github.com wrote:

In #1470 https://github.com/FreeUKGen/MyopicVicar/issues/1470 @benwbrum https://github.com/benwbrum commented. "This only relates to changes in development to make URLs more friendly to search engines and those end users who download and save links to our records." @Captainkirkdawson https://github.com/Captainkirkdawson asked. "Do not understand how this makes a search record more friendly to search engines." To which @benwbrum https://github.com/benwbrum responded "I don't understand the mechanics, either, but did a great deal of research on the subject, and this appears to be helpful." Followed by my comment. "Since search engines have no access to search records then the only way would appear to be if the pretty url was stored in a database by the user and that was accessible. Seems a long shot to me. But I defer to others." To which @richardofsussex https://github.com/richardofsussex seemed to respond "How about using XML sitemaps (https://support.google.com/ webmasters/answer/183668?hl=en) as our SEO mechanism, and leaving the URLs opaque?"

My question is Since our sitemap currently currently forbids Search Engines from accessing the search records are we proposing that when friendly URLs are implemented we will be permitting the search engines to trawl through and index the search records? Given the proposed nature of the friendly url why will people use our site if they have the basic answer in google?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-395446831, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGa1S6b62rB5W06UQW9JHBB9iAXVbks5t6Tw7gaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

edickens commented 6 years ago

We need to be very careful here. If people can take all our transcriptions and create their own database, and I expect Ancestry will add it to their collection, then I know that a lot of transcribers will delete their work. Also, if transcribers, who have put in a lot of work for free, hear that we are giving away their work, we will not have any transcribers. There is an "Open Data" marker against images, and so should there be one in a transcribers profile to say that they want all their work to be Open Data or not? If there is an "Open Data" marker = "No" against the images , then should the associated transcription also not be Open Data? Most data we are given to transcribe is under the understanding that it is for transcription and for searching. I appreciate that our new Agreement gives us rights over the use of the trancriber's transcriptions, but giving away their work is not one of them. There could be big trouble here.

benwbrum commented 6 years ago

There is a separate conversation about open data and bulk downloads of records which I don't want to re-hash here. This is not about that idea.

This feature (which is actually a tracked in a different issue) is about allowing search engines like Google, Yahoo, and Bing to crawl a subset of our records for which our users have already given permission, in order to better lead researchers to our site. The scenario is, 1) user types something like "craness marriages cornwall" into Google, 2) Google shows links to some of our records in their search result pages, 3) user clicks and ends up on our website. This seems like an excellent way to drive traffic to our search engine and to provide service to researchers who don't know about our organization.

Captainkirkdawson commented 6 years ago

@benwbrum wrote "This is absolutely part of a plan to allow search engines to crawl those entries which are available under open data."

If that is the plan then I would suggest that care be taken so that the "friendly url" does not give all of the information. Why? Because I suspect many researchers who use google (or other engines) in such a manner are not prone to digging any deeper than what is in the url eg john-whittle-burial-staffordshire-wolstanton-1637-03-10? and would not go further. (I know that is a cynical view of my fellow humans but after 79 years nothing surprises me any more.) a url of john-whittle-burial-staffordshire-wolstanton would make them click the link!

edickens commented 6 years ago

Yes, that would be a great idea. The URL then could be for surname+county+place+church and then link to our search page for them to put in a proper search. This is not giving away any data. I would omit firstname as well, and definitely years and perhaps even record type. There is no Open Data problem and I am sure transcribers would approve. But Ben, try to work out how many URLs there will be. It will be a lot.

edickens commented 6 years ago

If the URL took them back to our Search criteria page, could the link part fill the boxes? They then need to add years, record type and firstname if known.

PatReynolds commented 6 years ago

I think Ben's proposal is fine. It doesn't give all the information available. The burial ones are probably most complete, as they have most of the data one can see on a search results page ... but even so, one would want to click through to see if there is anything additional such as age, status, etc. that might mean that you have the wrong person. For baptisms and weddings, much of the really interesting stuff (parents, partners, etc.) is not in the url.

SteveBiggs commented 6 years ago

But perhaps rather than including the full date in the URL, having just the year would almost guarantee they click through and there is still enough in the URL to catch the search.

benwbrum commented 6 years ago

This has been implemented in the same branch as the search engine work, and should be tested at the same time.

richpomfret commented 5 years ago

@benwbrum is this ready for testing?

benwbrum commented 5 years ago

It is deployed on test2, and I believe it should be ready for testing. The data itself is not quite ready to setup the browse-for-seo testing, I'm afraid...

Ben

On Wed, Sep 26, 2018 at 10:54 AM Rich Pomfret notifications@github.com wrote:

@benwbrum https://github.com/benwbrum is this ready for testing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-424765554, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGSdb_5HqhUSMRCwpqemnf7xSXnWqks5ue6NOgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

smrr723 commented 5 years ago

Gave test2 a quick test as requested @PatReynolds, searching JOHN SMITH, 1730-1790, Buckinghamshire.

The URL of an individual entry, when clicked is - https://test2.freereg.org.uk/search_records/5817ac4ce93790eb7f5d0f0f/esther-weston-joh-smith-marriage-buckinghamshire-aston%20clinton-1730?search_id=5baca159e937908c3c82c38f&ucf=false

So that seems ok, I think? Not quite sure what the requirements here were.

I'm getting an error on that page, however. Looks to be some syntax errors in the search_records_controller, on lines 105, 108, 126 @benwbrum

/home/apache/hosts/freereg2/development/app/controllers/search_records_controller.rb:105: syntax error, unexpected <<, expecting keyword_end <<<<<<< HEAD ^ /home/apache/hosts/freereg2/development/app/controllers/search_records_controller.rb:108: syntax error, unexpected ===, expecting keyword_end ======= ^ /home/apache/hosts/freereg2/development/app/controllers/search_records_controller.rb:126: syntax error, unexpected >>, expecting keyword_end >>>>>>> master ^

richardofsussex commented 5 years ago

That deals with the 'prettify' aspect of the requirement; what about the 'persistent' aspect? Is the URL fragment /5817ac4ce93790eb7f5d0f0f, or the search_id 5baca159e937908c3c82c38f, meant to be the persistent aspect? The error message is caused by Git diff markers being left in the source code; some form of Git resolution is required, I think.

benwbrum commented 5 years ago

I agree with Richard on the bad merge of the git source. Let's move this one back to 'in progress' and I will fix that.

Half of persistence is accomplished by making the URL no longer dependent on the search_id parameter -- we should be able to delete all the query parameters after the ? (including it as well) and have the URL still work. I need to review our discussions of this issue several months ago to see if there were other concerns.

On Thu, Sep 27, 2018 at 4:57 AM Richard Light notifications@github.com wrote:

That deals with the 'prettify' aspect of the requirement; what about the 'persistent' aspect? Is the URL fragment /5817ac4ce93790eb7f5d0f0f, or the search_id 5baca159e937908c3c82c38f, meant to be the persistent aspect? The error message is caused by Git diff markers being left in the source code; some form of Git resolution is required, I think.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-425031973, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGfr45799Zpejqwu67n5bM_AUkb8mks5ufKDvgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

benwbrum commented 5 years ago

This is deployed on test2 and ready for exploration.

Example of URL which should be permanent (i.e. not dependent on search query state) and friendly: https://test2.freereg.org.uk/search_records/5817c0c0e93790eca3cfa9a8/john-whittle-baptism-worcestershire-redmarley%20d'abitot-1714

edickens commented 5 years ago

Will this be unique? Places have more than one Church. And someone may appear in both a Parish Register and a Bishops Transcript. Perhaps the long number will sort it out. I've only looked at the text part. But it works.

benwbrum commented 5 years ago

The number does make the URL unique, so it is a reference to this specific record on our system.

On Wed, Oct 3, 2018 at 9:50 AM Eric Dickens notifications@github.com wrote:

Will this be unique? Places have more than one Church. And someone may appear in both a Parish Register and a Bishops Transcript. Perhaps the long number will sort it out. I've only looked at the text part. But it works.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-426666294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGRkNWdWbmXrTiYFXuel3banCoY9uks5uhM6qgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

PatReynolds commented 5 years ago

Search query gets : https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5db/ann-bainbridge-william-rowntree-marriage-yorkshire,%20north%20riding-east%20hauxwell-1825?search_id=5bb5cf89e937900aa3f62968&ucf=false

How do I (as a researcher) get to the more readable https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5db/ann-bainbridge-william-rowntree-marriage-yorkshire,%20north%20riding-east%20hauxwell-1825 ?

Do spaces have to be rendered %20? The one before the county seems redundant. https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5db/ann-bainbridge-william-rowntree-marriage-yorkshire,north_riding-east_hauxwell-1825 seems tidier, to me.

PatReynolds commented 5 years ago

Hie @edickens the unique bit is before the last stroke, so for Ann and William's marriage, the unique bit (indeed, the only bit you need to copy and paste) is https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5db/ anything that comes after that is not used at all in finding the record, it's just there so when you see the url, you can remember which person/event it is that you are looking at (useful for people like me who have difficulty telling https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5db/ from https://test2.freereg.org.uk/search_records/5817c0c0e93790eca3cfa9a8 without such a prompt. You can, almost, put anything you like after the first bit and another stroke: https://test2.freereg.org.uk/search_records/5817c0c0e93790eca3cfa9a8/tom_bombadill_and_goldberry for example

edickens commented 5 years ago

What message is given to a researcher if one of these URLs fails? This can happen if the entry is deleted by the transcriber because say the batch has been duplicated. I cannot test this.

PatReynolds commented 5 years ago

Good question, Eric. I just tried slightly altering the permanent url (reversed last two characters) and got this:

https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5bd

I think this is acceptable (i.e. pretty obvious 'something went wrong!!!!' message, but not user-friendly. If there is no Bad Stuff (tm) that can happen as a result, we should look, longer term, at seeing how we might avoid this happening.

benwbrum commented 5 years ago

I very much like the idea of replacing spaces with hyphens. None of my testing apparently used multi-word names!

On Thu, Oct 4, 2018 at 3:59 AM PatReynolds notifications@github.com wrote:

Good question, Eric. I just tried slightly altering the permanent url (reversed last two characters) and got this:

https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5bd

I think this is acceptable (i.e. pretty obvious 'something went wrong!!!!' message, but not user-friendly. If there is no Bad Stuff (tm) that can happen as a result, we should look, longer term, at seeing how we might avoid this happening.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-426940188, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGQpqCTS-e83BuKUG0OvkPXTFQm8Xks5uhc3xgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/

SteveBiggs commented 5 years ago

I think the researcher should be returned a much clearer error page than that, stating something straightforward like "Invalid Search URL, please check" or similar.

edickens commented 5 years ago

But they would not have created the URL, so it would have been valid when they saved it. The messages should be more like "This entry has been changed so that this URL is no longer valid, please redo your search and save the new URL." And what happens if a name is amended? My guess is that the URL will work, but bring up the revised name. Probably OK, but the researcher may be confused.

richardofsussex commented 5 years ago

On 04/10/2018 11:17, Ben W. Brumfield wrote: I very much like the idea of replacing spaces with hyphens. None of my testing apparently used multi-word names! Can't you use '+', which actually means space?

Richard

On Thu, Oct 4, 2018 at 3:59 AM PatReynolds notifications@github.commailto:notifications@github.com wrote:

Good question, Eric. I just tried slightly altering the permanent url (reversed last two characters) and got this:

https://test2.freereg.org.uk/search_records/5818442de93790eb7f7ad5bd

I think this is acceptable (i.e. pretty obvious 'something went wrong!!!!' message, but not user-friendly. If there is no Bad Stuff (tm) that can happen as a result, we should look, longer term, at seeing how we might avoid this happening.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-426940188https://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-426940188, or mute the thread https://github.com/notifications/unsubscribe-auth/AAMNGQpqCTS-e83BuKUG0OvkPXTFQm8Xks5uhc3xgaJpZM4MghfUhttps://github.com/notifications/unsubscribe-auth/AAMNGQpqCTS-e83BuKUG0OvkPXTFQm8Xks5uhc3xgaJpZM4MghfU .

-- Ben W. Brumfield Partner. Brumfield Labs LLC Creators of FromThePage https://fromthepage.com/https://fromthepage.com/

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/FreeUKGen/MyopicVicar/issues/1144#issuecomment-426964008, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACGGbhKykRv02VhmFfHjWvzz-sdGWknDks5uheA_gaJpZM4MghfU.

-- Richard Light

smrr723 commented 5 years ago

From an SEO (and readability) perspective, hyphens are recommended over anything else for word separation.

'+' seems to be recommended for dynamic urls only and not static.

https://webmasters.stackexchange.com/questions/374/urls-should-i-use-hyphens-underscores-or-plus-symbols https://support.google.com/webmasters/answer/76329?hl=en

richardofsussex commented 5 years ago

OK, fine. So long as we avoid having %20 in the URLs, I don't mind how it's achieved.

benwbrum commented 5 years ago

I think I've just fixed the problem with spaces. I'd recommend testing on records that are in places or churches that contain spaces or apostrophes in their names.

This is ready to test on test2.

richardofsussex commented 5 years ago

Urls look fine now (though I haven't turned up any apostrophes). Is the identifier part of it now persistent across database builds?

Captainkirkdawson commented 5 years ago

@richardofsussex FR is not rebuilt. In fact it has never been rebuilt after the initial creation some 3 years ago. It is continuously updated with new and corrected records from uploaded files and online editing. A corrected or edited record may or MAY NOT retain the original record identifier depending upon the nature of the fields that are changed.

richardofsussex commented 5 years ago

Thanks - I was thinking that FR was like BMD (clearly not). So, in order to achieve persistent URLs, we need to revise the updating strategy so that record identifiers never change. What are the challenges in doing this?

Captainkirkdawson commented 5 years ago

@richardofsussex FR is based on a totally different concepts and technology from BMD. It was designed to support both file and online data entry systems such as script. The latter was never completed for a variety of reasons. However the underlying database supports online transcription it should it ever be developed. As noted it is currently possible to do online entry and correction and our members do online editing on a regular basis. FR is implemented using Mongodb and a search record is sharded across multiple servers using a shard key that is immutable. The shard key we use involves the date of an event. As a result any change in the event date in a search record be it from a file update or and online edit means the creation of a new search record and the deletion of the old one. Hence there will be situations in which a search record id does change or indeed is deleted without replacement. Hence absolute persistence of the search record id is impossible without implementing a different shard key (a really massive undertaking). We do trap any attempt to access a non existent search record id and it would be straightforward to add text suggesting that the search that for the specific record be repeated. If you wish to explore more fully the nature and structure of FR we could take that into a slack discussion. My handle is the same.

richardofsussex commented 5 years ago

Can't the original shard key be stored in an additional field, and this field 'carried forward' when the record is updated?

edickens commented 5 years ago

I think we need to get this into perspective. The number of times a record changes is small, and the odds that it will be one where someone has saved the URL are longer than winning the lottery. So Kirk's suggestion to just request the search to be done again is the best. Remember that names get edited more often than anything else, so the original search is no longer correct.