FreeUKGen / FreeCENMigration

Issue tracking for project migrating FreeCEN to FreeCEN2 genealogy record database and search engine architecture. Code developed here is based on that developed in MyopicVicar
https://www.freecen.org.uk
Apache License 2.0
4 stars 3 forks source link

Enable search of Place of Birth from VLD files #1340

Open Captainkirkdawson opened 2 years ago

Captainkirkdawson commented 2 years ago

The Search record is populated with the POB and Alternate POB from either the VLD or the CSV entries.

The CSV entry is validated against the Gazetteer.

The VLD entry is not.

Captainkirkdawson commented 2 years ago

Ensuring the Gazetteer HAS been used is a much bigger task as I noted earlier. There are 20 times as many different birthplaces as there are entries in the Gazetteer. i.e. there are over 900,000 birth places entered into records that are NOT in the Gazetteer. Pat asked 'if the process of incorporating a VLD file into the database could be changed in some way that would mean that the place name authority would reject files with unrecognised places'. Certainly it is feasible to add quality control to the VLD data ingest component. However what about the 10,000 Plus that would require revalidation of the birthplace field. Is the immense magnitude of this task something that we could consider? It took 20 years to enter and validate once, I cannot conceive of redoing the validation.

PatReynolds commented 2 years ago

Needs someone to understand how the issue might be resolved on the VLD files. @PatReynolds and @DeniseColbert to work on a spec for a volunteer - talk to @geoffj-FUG and @Captainkirkdawson to senure we are asking for the right things.

geoffj-FUG commented 2 years ago

Pat

To my mind this is just another patch up. It is also impossible to achieve because of the limitations of the FC1 spreadsheet. That is why CSVProc was developed.

Vld files belonged to FC1. Csv files belong to FC2 and the capability to match with the Gazetteer was designed into FC2.

The capability to match FC1 files to the Gazetteer can never be achieved because of the limitations of the FC1 spreadsheet. The FC1 spreadsheet only allows a certain number of characters in the Place of Birth column. That width is not changeable because FCtools, used to create Valdrev FC1 files, will reject overlength entries.

In order to use a spreadsheet csv file in Valdrev it needs to be converted to a dat file. FCtools was not designed to deal with this function. That is one of the reasons that CSVProc was created to replace software such as FCtools.

These width restrictions mean that transcribers etc entered abbreviations whenever a Place Name would not fit into the field (there are plenty of them).

These abbreviations are not in the Gazetteer and never should be. As well as being a validation tool, transcribers etc use the Gazetteer as a reference to check that a place name is correct or to identify a correct alternative place name. It has proven very effective in increasing the accuracy of transcriptions. If abbreviations were allowed to be included in the Gazetteer, then the effectiveness of the Gazetteer immediately declines because the data has been bastardised.

If the vld file is tested against the Gazetteer during Incorporation then all abbreviations will be rejected. They cannot be corrected in Valdrev because they would be difficult to identify once the file is validated and because the change will be rejected by the software. In practice 99% of pieces would not pass the test.

In addition, any entries that are not yet in the Gazetteer will also be rejected because the validator has no way of knowing unless they check every record manually.

The Placesup file that tests valdrev is not quality controlled and therefore is not a reliable test. It is also editable, so a Placesup file generated by a process such as a download from the Gazetteer is not reliable as a test. The validator can add to it (without quality control) and will need to make two changes to make it work – one to the Placesup file and the other to the Gazetteer. Catch 22 – the Placesup file for everyone else will be out of date and will need to be downloaded again.

There is only one practical way to ensure that the POB exists in an FC1 file at Incorporation and that is to convert it to a csv file using CSVProc and then validate it using CSVProc in FC2.

Geoff

PatReynolds commented 2 years ago

@PatReynolds and @DeniseColbert to ask remain coordinators for the placename files needed to complete this task.

geoffj-FUG commented 2 years ago

@PatReynolds @DeniseColbert - There is no point in getting the Placesup files from the Coordinators. Firstly they are not quality controlled. Secondly they are just a list of text entries with no additional information. A lot of the entries are not even valid. They were entered to make Valdrev work. The Gazetteer is a resource that is alive - it grows as new places are added. There is no point in adding new places until they are needed. In time we will collect all of the place names from all over the UK and a great number from around the world, all with their latitudes and longitudes. My point on 27 January is that this task cannot be done anyway. The valdrev software will not allow it to be done. CSVProc replaced Valdrev. Valdrev is a dead 25 year old DOS based programme. It is not being replaced. I believe that this story should be closed. Geoff

PatReynolds commented 2 years ago

Enabling a place of birth search from VLD files does not change or contaminate the gazzetteer (as I understand what is being proposed here). What it means is: if a place of birth in a VLD file is 'Totnes', it will be found by a search for 'Totnes', alongside the 'Totnes' people in files that have been validated against the gazetteer. If the place of birth in a VLD file is 'Tottnes' (and there is no 'Tottnes = Totnes' in the Gazeteer) then it won't be find. At the moment place of birth searching against VLD files cannot and does not happen.

PatReynolds commented 2 years ago

Part of Implementing Place of Birth Search.

@Captainkirkdawson to consider how to implement.

FreecenBren commented 2 years ago

I agree with Geoff. Besides that 24 years of records on about at least 4 to 6 different types of software, and at least 6 different types of PCs etc. with many hundreds of VLD were added directly using the very old system before W32 etc., and after that by Dave Mayall and his successors were the Validator was not recorded. Besides very time consuming if it were which I do not have anyway.

Sorry but I agree with Geoff.

On Wed, 23 Mar 2022 at 4:47 pm, PatReynolds @.***> wrote:

Part of Implementing Place of Birth Search.

@Captainkirkdawson https://github.com/Captainkirkdawson to consider how to implement.

— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1340#issuecomment-1076561033, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADIL3VUJESVH4NTGSPF4QATVBNDIXANCNFSM5LEEFH4A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

PatReynolds commented 2 years ago

Hello Brenda, the solution discussed yesterday addresses Geoff's concerns. We will use (in part) the code developed to find place names already (e.g 'begins with Stock" or 'ends with"dale" to identify places of birth in VLD files, and indicate that those records should be returned as part of the search results. No changes will be made to either the VLD files, or the Gazetteer, only to the search function for place of birth

Best wishes,

Pat

PatReynolds commented 2 years ago

@geoffj-FUG could you make a start with a role description - what do they need to do? What experience/knowledge will be needed

geoffj-FUG commented 2 years ago

Pat We need to decide how to do it first. Once the vld dataset is stable it can be isolated. That means that we can work on it. If it is not stable we will only have to redo the exercise when the next vld file is uploaded, and the next one, and the next one and so on. So, we cannot even start until no more vld files are being uploaded. Then we need a utility app to do it with. The app can be built bit by bit to do just one thing. Firstly it needs to identify every record that does not have of the Place of Birth and the Validators choice in the Gazetteer. (If one tests true against the Gazetteer the record is accepted) Then it needs 'replace all' functions similar to a spreadsheet. A spreadsheet will not do it as the data set is too big. This would be used to rearrange entries in accordance with the greater to smaller rule. This would also be used to expand entries that have been abbreviated. Retest the vld dataset and there will be a reduced number of entries need looking at. These should be validated with a routine similar to the validation routine. Once an alternative has been entered it should be promulgated with a choice - global or piece only. (For instance Kingston in SOM could be Kingston St Mary or Kingston Seymour depending on where the piece is located. So the option of piece only means that the alternative can be geographically restricted). Once promulgated the rest of the entries should not be available for re-validation because the 'one of' match now exists. So, we need several things to happen prior to a duty statement - in order: End of vld uploads Development of testing tool against the Gazetteer Development of the replace all functionality for the vld dataset. Only when that has been completed should we move on to actual validation. Geoff