Open geoffj-FUG opened 1 year ago
@geoffj-FUG. Yes that is correct. I have changed the text type and it is available for testing on Test3. (I don't seem to be able to change the pipeline for this or add a label - no idea why?).
@AnneV-Learn I have entered several alternative places of birth into the spreadsheet and got the following results: Started on the file RG12_1900_test.csv for somt.cen at 2022-12-21 07:25:15 +0000. Working on Ilchester for 1891, in SOM. ERROR: line 2 Alt. Birth Place ����� is invalid text. ERROR: line 3 Alt. Birth Place ����� is invalid text. ERROR: line 4 Alt. Birth Place ����� is invalid text. ERROR: line 5 Alt. Birth Place � is invalid text. ERROR: line 6 Alt. Birth Place ����ܟ? is invalid text. ERROR: line 7 Alt. Birth Place ��� is invalid text. ERROR: line 8 Alt. Birth Place �����? is invalid text. Warning: line 27 House address ����ܟ? has trailing ?. Removed and address_flag set. Warning: line 27 Address Flag is x. Warning: line 34 House address �����? When I found that the characters were not accpeted in the birth_place field I copied them into the address. They were accepted in several instances there (but not all). I entered the test characters using the table at https://www.extendoffice.com/documents/excel/4903-excel-add-accent-mark.html I will send you the test spreadheet by email. Geoff
Anne
Test spreadsheet attached
Geoff
From: Anne Vandervord @.> Sent: Monday, 19 December 2022 12:34 AM To: FreeUKGen/FreeCENMigration @.> Cc: Geoff J @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Change definition of birth_place field (Issue #1469)
@geoffj-FUG https://github.com/geoffj-FUG . Yes that is correct. I have changed the text type and it is available for testing on Test3. (I don't seem to be able to change the pipeline for this or add a label - no idea why?).
— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1469#issuecomment-1356812382 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFM22WBXRVBBDSI32R3WN4OGZANCNFSM6AAAAAATASJ3WA . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFO3E5VLBUMRYIP6FIDWN4OGZA5CNFSM6AAAAAATASJ3WCWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSQ35IF4.gif Message ID: @. @.> >
@geoffj-FUG - I've done some further investigation.
Before I made any code changes the verbatim_birth_place field was defined as Broad Text and birth_place was defined as Narrow Text. I changed birth_place to Broad Text.
Broad text appears to accept A-Z, a-z plus the following characters .,&'":; whereas Narrow text accepts A-z, a-z plus ' and .
House or street (address) allows any text but will convert a trailing ? To an address flag.
FYI - the Gazetteer allows any character in the place name.
@geoffj-FUG
Apologies my understanding of the definition of BROAD_TEXT in FreeCen CSVProc validations (in previous comment) was not quite right (quite difficult to interpret the Regular Expressions used).
BROAD accepts A-Z, a-z, 0-9, a single space plus the following characters -_.’,()&":;
NARROW accepts A-Z, a-z, 0-9, a single space plus the following characters -_.’,
The production environment is currently set up as follows:
Surname - BROAD Forenames - BROAD Verbatim Place of Birth - BROAD Place of Birth - NARROW Where father born - BROAD
Test3 is the same apart from Place of Birth which I changed to BROAD.
Looking in the current version of the Handbook I see that it states Place of Birth is defined as BROAD (not currently the case in production code as indicated above).
But in the Handbook BROAD and NARROW are defined as follows:
Narrow Text – Narrow Text permits any word character (letter, number, underscore) plus , '
(i.e. does not mention that . or - are permitted)
Broad Text - Broad Text permits any word character (letter, number, underscore) plus - ( ) .,&' (i.e. does not mention that : ; " are permitted)
The Handbook also states:
The ‘As Is’ Rule ……. Do Not Enter Any Special Characters - Characters such as fractions, accented letters or similar are outside of the character sets for text fields. They will be reported as Errors when the spreadsheet is tested. The characters that can be used in text fields are outlined in the Column (field) Types section above. Fractions are permitted in some numeric fields.
Note 1: If the surname has been enumerated with an accent (e.g. e acute or e grave). Enter the non-accented letter. Letters with accents are not within the character set.
Note: If the forename has been enumerated with an accent (e.g. e acute or e grave), you will need to enter the non-accented letter. Letters with accents are not within the character set.
If the Place of birth has been enumerated with an accent (e.g. e acute or e grave), enter the non-accented letter. Letters with accents are not within the character set.
Note: In some cases, the father’s Place of birth has been enumerated with an accent (e.g. e acute or e grave). In this case enter the non-accented letter. Letters with accents are not within the character set.
So accented letters are not allowed/accepted, which is true for the fields defined as BROAD or NARROW.
All pretty complicated but I hope that explanation clarifies rather than confuses further!
One of the things that Kirk did was to change the definition of Broad Text to include the specific characters used in Welsh and Irish Place names, and some overseas Place Names such as French. The verbatim_birth_place field is defined as Broad Text. This is correct. The bitrth_place field which holds the Alternative Place of Birth is still defined as Narrow Text (I believe). The definition of the birth_place field needs to be changed to Broad Text so that it can accept the special characters. @AnneV-Learn can you check that I am correct, please? If so, are you able to change the text type. As Broad Text accepts more characters than Narrow Text (including all of Narrow Text Characters) this should not cause any data integrity issues. Geoff