FreeUKGen / FreeCENMigration

Issue tracking for project migrating FreeCEN to FreeCEN2 genealogy record database and search engine architecture. Code developed here is based on that developed in MyopicVicar
https://www.freecen.org.uk
Apache License 2.0
4 stars 3 forks source link

Implementation of UTF-8 #1655

Open geoffj-FUG opened 9 months ago

geoffj-FUG commented 9 months ago

Anne reported that Irish characters not displaying correctly. See attached screenshot. For example Kilmoon. See Wikipedia for the correct Irish name display. I think the problem is that the page encoding is UTF-8, but the name is encoded in Unicode. This may be a general problem. In September 2019 Kirk said “Have added code to ensure conversion to UTF-8. Tested on my development setup”. Story #725. When Kirk became ill there were many balls in the air. Some have been collected and resolved. Others bounced away and hid under the furniture. I know that UTF-8 was one of these balls. The significance of this initiative being missed was not obvious to me. Now I understand why this was just one of the many initiatives being worked on concurrently. From my knowledge it was certainly one that was to be achieved.

It appears that this was not fully implemented. We did however introduce a range of characters to the Gazetteer and Kirk got me to amend the definition of acceptable characters for Broad Text in the Handbook. There have been problems with these characters since.

We need to review how far the introduction of UTF-8 got, and what is needed to complete this change.

Geoff

AnneV-Learn commented 9 months ago

@geoffj-FUG I've done some investigation. Verbatim birth place and Alternative birth place are currently validated used the BROAD_VALID_TEXT pattern match. Currently if an accented character is entered in CSVProc data it will be rejected with an 'Invalid Text' ERROR. So it looks like we do need a wider pattern match that includes accented characters for birth places. Accented characters CAN be entered in Gazetteer place names.

geoffj-FUG commented 9 months ago

In that case we need to widen the pattern match so that accented characters can be entered in fields defined as broad text. If we do it now it will negate problems in the future. Channel Islands 1911 definitely needs them as soes the Isle of Man. People born overseas in 1911 will also use them. 1911 is already being transcribed in a few counties.

Geoff

AnneV-Learn commented 9 months ago

@geoffj-FUG Looking at the code BROAD text is currently defined as:

Any letter in the basic Latin alphabet, number, underscore, space, hyphen, open round bracket, close round bracket, dot, comma, ampersand, single quote, double quote, colon, semi-colon.

CSVProc data validation uses it to validate the following fields:

house or street name (i.e. address) surname forenames occupation industry birth place father place of birth
 disability
 disability notes

An 'extended' form of BROAD text, which also allows question mark and forward slash, is used to validate the notes field.

I am thinking that we perhaps need to have a new text category which would allow accented characters, rather than just the basic Latin alphabet, because accented characters may not be appropriate for all the fields where BROAD text is currently applied.

There appears to be NO validation applied the Place Name or Alternative Place Names when creating a new place or alternative place.

AnneV-Learn commented 9 months ago

@geoffj-FUG We may have to consider what is allowed in the Search Field when Searching the Gazetteer too. Currently for Advanced Searches the Search Field must contain only standard Latin letters (minimum of three).

geoffj-FUG commented 7 months ago

Just out of interest I copied and pasted Sainte-Marie-du-Câtel GSY as an alternative name to St Mary de Castro when the second was mis-spelt in a POB place name. The system matched the Sainte-Marie-du-Câtel to the same entry in the Gazetteer. So the system is handling the comparison betwwen the fields despite only the Gazetteer being UTF-8. The issue is whether we convert the broad text fields to UTF-8 to allow the transcription of place names in the future. Geoff

Vino-S commented 7 months ago

@AnneV-Learn to investigate

AnneV-Learn commented 7 months ago

@geoffj-FUG
My Observations: 
1. As I noted above there there appears to be NO validation applied to the character set for a Place Name or Alternative Place Name when creating a new place or alternative place name in the Gazetteer.
 
2. Gazetteer Advanced Search does not allow 'accented' characters, so if you search for Contains Câtel it is rejected with 'Advanced search must contain alphabetic characters only' error message. And if you search for Contains Catel, Sainte-Marie-du-Câtel is not found. However a 'normal text search' in the Gazetteer will match/find 'accented' characters. 
3. 'Accented' characters cannot be loaded as CSVProc data for 'birth place' or 'father place of birth' as these are validated against the definition of BROAD text. Error such as 'Verbatim Birth Place Sainte Marie Du Câtel Is Invalid Text.' will be raised. 
4. See above for other fields that are currently validated against BROAD text definition so CSVProc upload will raise an error if 'accented' characters are present in these fields. 
5. VLD POB validation does not restrict updated 'Alternative Place Names' to BROAD text, so 'accented' characters can be entered and will match Gazetteer as you found. 


DeniseColbert commented 7 months ago

To be addressed after POB search engine is developed.

geoffj-FUG commented 6 months ago

I had an opportunity to experiment.

I entered all the options for Finland Tornio into the Gazetteer (lots of accents etc). I then copied and pasted Finland Torneå (the Swedish spelling) into the alternative place of birth. Although the Swedish spelling was in both data sets the system would not match them. I then pasted the Lapland spelling of Lapland Tuárnus into the alternative POB and it was accepted and matched.

I looked and the Gazetteer had not saved the Swedish spelling. I copied and pasted it again and it saved.

The å seems to have been the problem. I do not have another option to test it at the moment as it is a live file.

Geoff

geoffj-FUG commented 5 months ago

Cheryl, one of my validators says that: I am still having a problem with punctuation marks creating an "invalid text" error message. I had one French surname in this piece D'Aubrey and the system sent an error message for this. I deleted the apostrophe and the name was accepted.

Is this a UTF-8 issue?

Geoff

AnneV-Learn commented 5 months ago

@geoffj-FUG Surname is validated against BROAD_VALID_TEXT definition which accepts a true apostrophe ' but not a right single quote which is what I suspect Cheryl must have had in her text. They are subtly different in appearance and the underlying UTF-8 code is different. Some word processors try to be too clever and convert a simple apostrophe to a right single quote (more slanted/curly). I think we need to stick with the straight apostrophe being permitted but not the right single quote, as a straight apostrophe is the standard character from most keyboards and is what researchers would readily use in a search.

geoffj-FUG commented 5 months ago

Anne

I agree

Geoff

From: Anne Vandervord @.> Sent: Sunday, June 23, 2024 8:00 AM To: FreeUKGen/FreeCENMigration @.> Cc: Geoff J @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Implementation of UTF-8 (Issue #1655)

@geoffj-FUG https://github.com/geoffj-FUG Surname is validated against BROAD_VALID_TEXT definition which accepts a true apostrophe ' but not a right single quote ‘ which is what I suspect Cheryl must have had in her text. They are subtly different in appearance and the underlying UTF-8 code is different. Some word processors try to be too clever and convert a simple apostrophe to a right single quote (more slanted/curly). I think we need to stick with the straight apostrophe being permitted but not the right single quote, as a straight apostrophe is the standard character from most keyboards and is what researchers would readily use in a search.

— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1655#issuecomment-2184202914 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFJMENX2UPKPHLNAFBDZIXXWZAVCNFSM6AAAAABCZ2EDC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUGIYDEOJRGQ . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFOOCCXGZXLB7UYS4XDZIXXWZA5CNFSM6AAAAABCZ2EDC6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUCGBFKE.gif Message ID: @. @.> >

AlOneill commented 5 months ago

@AnneV-Learn @geoffj-FUG I think we need to be careful here. The keyboard apostrophe and single right quote are the same thing — there is only one key — even if the spreadsheet software is substituting curly (typographers or smart) quotation marks.

Where is the "true apostrophe" key as opposed to the single quotation mark key? (Perhaps my Mac keyboard does not have it …)

FreeREG had the same issue: it was resolved by telling spreadsheet users to turn off all the 'smart' substitutions. Do you think that this is the issue here?

AnneV-Learn commented 5 months ago

@AlOneill @geoffj-FUG On a Mac the true apostrophe is the default one on the keyboard, to get the right single quotation mark you can use the ‘option’ key and the ] key. They are not the same thing as the underlying UTF-8 ‘value’ is different. Turning off smart substitutions may be a good idea but might not suit all users. If during data validation users encounter the invalid text error they should investigate and make sure that it is a pure apostrophe and not a right single quote.

geoffj-FUG commented 5 months ago

Thanks Alison and Anne

I raised this only because we have the UTF-8 issue as a story. When I find something that might be relevant I bring it up.

It appears that this was a really old vld file that was being converted to csv and re-proofread and revalidated because the original transcription had so many issues with it.

I will file the facts away in the back of my mind and leave it at that. The transcriber has not been with FreeCEN for years.

Geoff

From: AlOneill @.> Sent: Sunday, June 23, 2024 6:54 PM To: FreeUKGen/FreeCENMigration @.> Cc: Geoff J @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Implementation of UTF-8 (Issue #1655)

@AnneV-Learn https://github.com/AnneV-Learn @geoffj-FUG https://github.com/geoffj-FUG I think we need to be careful here. The keyboard apostrophe and single right quote are the same thing — there is only one key — even if the spreadsheet software is substituting curly (typographers or smart) quotation marks.

Where is the "true apostrophe" key as opposed to the single quotation mark key? (Perhaps my Mac keyboard does not have it …)

FreeREG had the same issue: it was resolved by telling spreadsheet users to turn off all the 'smart' substitutions. Do you think that this is the issue here?

— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1655#issuecomment-2184907623 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFLBQ3YC2RNLXDD2D33ZI2EKLAVCNFSM6AAAAABCZ2EDC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBUHEYDONRSGM . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFJ7NJILN3CQNRURYBDZI2EKLA5CNFSM6AAAAABCZ2EDC6WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTUCHMFWO.gif Message ID: @. @.> >

AlOneill commented 5 months ago

@AnneV-Learn @geoffj-FUG
Interesting: option - ] gives me – a pair of single curly quotation marks; option - [ a pair of doubles, in my text editor (TextMate). If I do the same in this app, I'm getting ‘ and “. We would still have to tell transcribers to disable 'smart' substitutions else their apostrophes would become smart quotation marks in many apps (I have mine turned off). We need to warn volunteers and explain the situation to them.

So, what do we get on a Mac with shift and apostrophe? (I'm on a laptop and have never owned a desktop Mac.)

What is the situation on a Windows PC?

AnneV-Learn commented 5 months ago

@AlOneill I get a pure standard double quote " on my iMac with shift and apostrophe.

geoffj-FUG commented 1 month ago

@AnneV-Learn We have a hiatus of about a month before I am available to work on our next initiative which should be to develop the next step in POB searching.

I believe that if we tidy up the UTF-8 issue then we will have removed a possible challenge during the POB searching development. We have an opportunity to tackle this if the task is not too big. Apparently 98% of websites use UTF-8. This would bring us in line with industry standards as well as solving the UTF-8 queries that arise.

What is involved in making the changes to implement UTF-8 for Broad Text fields, please?

Geoff

AnneV-Learn commented 1 month ago

@Geoffj-FUG

As noted previously: CSVProc data validation uses BROAD_TEXT 'mask' to validate the following fields:

house or street name (i.e. address) surname forenames occupation industry birth place father place of birth
 disability
 disability notes

NB An 'extended' form of the BROAD text mask (named BROAD_TEXT_PLUS), which also allows question mark and forward slash, is used to validate the notes field.

If we are talking about creating a new validation 'mask' (maybe named PLACE_NAME_TEXT) which allows diacritics then that at first glance that should be ok as long as a standard place name is not entered into the Gazetteer twice (ie once with accents and once without!). There may be a case for only allowing diacritics in the Alternate Place Names in Gazetteer. The Gazetteer search uses a special sort of Mongodb Search called a $TEXT search which accommodates finding hits with or without diacritics ie é will still find a plain e. The actual Validation code for Place of Birth would have to be checked to see that it will work with diacritics when determining if a POB is valid.

However, if we are talking about allowing diacritics in First Name and or Surname then I think it is a much bigger challenge as I don't believe the General SEARCH for FreeCen will find matches unless they are exact matches (ie é will not find a plain e and vice-versa) - needs further investigation.

Also, what about FreeBMD and FreeREG place names?

Vino-S commented 1 month ago

@DeniseColbert we need a policy on this before developing the Search

DeniseColbert commented 1 week ago

FreeBMD doesn't include non-standard characters in place names. Vino will check FreeREG's use and report back.

It's agreed that UTF-8 is what we want to achieve and now is the time to do it, even if it is a lot of work: it will prevent bigger problems in future.

Suggested to test the system as it currently is (i.e. where Kirk's work left us) using non-standard characters in Test 3 to identify and fix any bugs, and move forward from there, if appropriate.

AnneV-Learn commented 1 day ago

@Vino-S FYI came across some information about the use of collation indexing in MongoDB it may be of use to solve the UTF-8 issue for searching but not sure about how it works with RAILS. https://www.mongodb.com/docs/current/reference/collation/