Closed geoffj-FUG closed 1 year ago
On investigation I received the following information from Vino: Query The specific 1841 file I had trouble with is HO107_954 (copy attached). The problem is evident in any HO107 1841 download of a vld file. My query is about the structure of the vld dataset. The valdrev upload is a fixed length flat file. The first character in the upload is a code that changed during progress through the DOS based software. This was closely followed (not sure if it was immediate) by a number which was the number of the family (i.e. changed as the schedule changed) and a smaller number which was the position of the person in the family (1.2,3 etc). These two fields to my mind formed part of the key to the file. My query was whether this key still existed? If so, it is probably the solution to resolving our consecutive schedule numbers of 0 problem in1841. Reply Apologies Geoff, Unfortunately the Key does not exist.
Query That is serious. I can therefore see no way of breaking up the vld files into schedules as all had a schedule number of 0. We have been fixing up some of the other years downloads manually when there were repetitive 0s in the schedule numbers. They were all marked with an unoccupied building code so could be identified. We can’t do that with 1841. Everything is schedule 0.
When you have the opportunity can you have a look at the original FreeCEN1 dataset for me and see if that key is there, please?
I am hoping that you will see something like this: C 0000010001Walcot 1 4 2 1 1 Vineyards -JAQUES Susanna -Wife MF64 y-Housekeeper -GLSWesterleigh -
In this sample upload record the C is a processing status code It looks as if the first 2 00 are not relevant to this problem. Then 0001 is the family number and 0001 is the position of the person in the family. Then Walcot is the Civil Parish, 1 is the ED number 4 is the folio, 2 is the page, 1 is the schedule number, 1 is the house number, Vineyards is the Address, - is no flag, Jaques is the surname, Susanna is the forename, - is no flag, M is Relationship, F is sex, 64 is age, y is years, - is no flag, Housekeeper is occupation, - is no flag, GLS is chapman code, Westerleigh is Place of birth and – is Notes. Some of the spaces are empty fields. Everything in the upload was fixed length. (No flag means that an x query flag was not entered).
I would assume that the piece number is included somewhere in the FreeCEN1 record as well.
If the 2 key fields are in the original FC1 database record, we have a chance of retrieving this situation.
Now I understand why Kirk was having so much trouble with the schedule 0s. It appears that the repetitive 0 schedule numbers may not have been considered in the design of the FC2 database.
Reply
I now have an Action Report Reference 236560824 which indicates that our families are being broken up incorrectly when we display 1841 files to the researcher.
I have a strong feeling that the incidence of repetitive schedule numbers of 0 was not identified when the FreeCEN2 dataset was designed and it appears that the incidence of a schedule number different to the previous one may have been used as a trigger for a new family.
If that is the case we may need to add these fields to the vld dataset and repopulate 1841 entries from the legacy FreeCEN1 dataset. Then the way that we handle the reports and downloads will also need to be revised.
This is a major change and the problem is affecting all of our data as the repetitive 0 problem also raises its head in other years.
It may need some extra work to confirm my findings and a dedication of resource to fix it.
Geoff
I am now over 95% certain that there is a fundamental problem in the design of the FC2 vld dataset. It appears that 2 fields that form a key and were in the FC1 dataset are missing from the FC2 dataset. The important one of these fields is the family number. When that number changed it indicated a new family. In 1841 all schedules were 0 and therefore this number is essential to break up the piece into families.
The problems that have arisen due to this problem include: • 1841 vld pieces cannot be broken into schedules; • Repetitive unoccupied houses generate errors in the test report when the piece is downloaded because the change in house is not recognised; • I have an action report that shows that families are being split when an 1841 vld piece is being researched; • I suspect that the occasions when an enumerator has used the same schedule number twice for consecutive families the 2 families are being combined;
There were other problems that Kirk has fixed using workarounds. It used to cause him no end of headaches.
I do not know what sort of data is being uploaded from the current vld uploads. The data in these fields is evidently being stripped.
The solution as I see it is • to restore the family number to the vld dataset. • It will then need to be populated from FC1. • The legacy vld upload will need to be amended to allow for this additional field • The download code will need to be amended to enable the schedules to be split up properly • The screen reports for researchers will need to be amended to make sure that families are not displayed incorrectly.
That is my analysis. It will need to be confirmed.
Question: How is the schedule number change for 1841 handled for csv files in FC2?
Geoff
@geoffj-FUG @Vino-S this crossed my inbox and It tweaked my interest as I am still alive and kicking. Some observations and a question come up when reading your posts. Firstly the Key you ask about is indeed ingested into Freecen2 as part of the vld entry for a vld file. As to how, when and where it is used are other questions to which I will not respond at this time. Secondly the Key is not used in the downloading of a VLD file in either CSV or CVSProc formats and remember that only CSVProc downloads are available to coordinators. The CSV download is a system administrator option only. Thirdly you are correct that failure to use that information is likely the source of many issues. Certainly I was unaware of its existence and how it might be used. Now the question. You have mentioned a file HO107954 that contains an entry C 0000010001Walcot 1 4 2 1 1 Vineyards -JAQUES Susanna -Wife MF64 y-Housekeeper -GLSWesterleigh - I cannot see that entry in that file in either the original FC1 file nor in the FC2 file for that name nor is Susanna JAQUES retrievable in a FC1 search nor FC2 for 1841. She is for 1851 and 1861 and the entry you refer to is there but for 1861 and in file RG091690
Kirk
I am glad to hear that you are still out and about and taking an interest in FreeUKGen.
In answer to your question. The record for Susanna is in the only Valdrev piece that I now have access to, my old Laptop having crashed. It would have been uploaded to FC1. The entry is in the vld dataset.
I have attached the downloaded file (cc’d to Kirk) for your information. It is not Susanna's piece. As you can see we have a problem in that every record is 0. I have not deleted the file; I have only downloaded it as CSVPro.
I am simply going back to the original FC1 Valdrev flat file logic and investigating the family entry that used to break up the families in FC1 and the retention of that information. The part of the overall key that interested me is the family number and position in family (1,2,3 etc).
In the vld file as uploaded the family number is 4 digits starting at 1 and incrementing through the piece. If a manual edit of the valdrev file was needed, a coordinator simply added a new family number where it was needed. I used to start at 4000 to make sure it was clear of the maximum number in the file. It did not need to be sequential. From the record that Vino sent me it looks as if that number is updated in the FC1 database to increment throughout the dataset. So it looks as if it is unique for each family in the whole dataset. (I had previously theorised to myself that it might have been combined with the piece number as part of a compound key).
It is good to know that the key has been ingested into FreeCEN2. It eliminates any need to update the datasets. I am about to test the downloads from the csv dataset. I have the numbers of 3 x 1841 pieces that have been recently uploaded as csv files. It will be interesting to see whether the same problem rears its head when I download them.
I firmly believe that a rewrite of the code to use that family number key we can resolve all the schedule 0 issues. It would also be compatible with the existing 1851-1911 files that appear to start a new family when the schedule number changes (Kirk, can you confirm this theory please?).
I still need to have my thoughts double checked by someone else in case I am on the wrong track.
Thanks for your advice. Please, if you spot anything that will help let us know.
Geoff
I have downloaded 3 x 1841 csv files , each in a different County with a different Coordinator, and they all seem to have the schedule numbers on the correct line. The issue therefore seems to be limited to the vld dataset.
There are instances where the schedule number and address are the same in following families and the schedule number is reported correctly. From that it would seem that the logic in processing the vld dataset and in processing the csv dataset is different.
I do not have the ability to extract the data from the csv dataset so cannot confirm that the Incorporated data is behaving correctly. My access is to the Coordinator's copy of the Incorporated piece.
Geoff
@geoffj-FUG @Vino-S Thank you.
Thank you Kirk, that is great.
The field in the map that I have been referring to is
257 # C 5 6 A six digit number (leading zeros) which counts the households
Kirk has given us a solution to our immediate problem.
We will need to do the rewrite in order to fix this issue properly. It should then resolve the other schedule 0 issues that we have experienced.
It looks to me (I am not a programmer, I am an analyst) as if the code here:
We should be able to add the field to the field list on rows 7-29 of the code and the code above can then be replaced by a far simpler if family number <> previous family number then use the schedule number. As Kirk says this field is not in the field list on lines 7-29.
Kirk
My query is (sanity checking) – if we change lines 526 – 527 so that line = “” instead of line = “0” will the schedule number 0 in lines 532-533 appear OK as it is repetitive? It looks OK to me but I last programmed in the late 1980s. If so, I am willing to test this as a temporary fix. Will it also resolve the schedule number in the search issue, or will that still exist?
Geoff
@geoffj-FUG The quick fix changes line 527 to line = @blank The rest of the code applies to non 1841 files and remains the same
@geoffj-FUG The quick fix has been applied to test3 you should test.
@geoffj-FUG Great care would have to be taken with any revised coding that tried to make use of the dwelling number and or the sequence number as those number are not necessarily in order and subject to large increments.
Kirk
That removes the 0 OK. But we now have all null entries instead of all 0 entries.
I notice that if nothing was in the address field a – was entered so not null will pick up that hyphen.
We need something like:
If year = 1841 and house_or_street_name is not null
Set schedule_number to 0
Endif
I can personally populate the column in the spreadsheet by using and if and then copying and pasting the results, but some Coordinators may not be happy with that.
The above suggestion should resolve that problem.
The 1841file will then be usable.
Geoff
@geoffj-FUG Please test again; is '0' correct against uninhabited?
Kirk That looks good. Lets go with it! :) Geoff
@geoffj-FUG The quick fix has been deployed to production. I will leave it to the scrum to assign the long term rewrite of the CSVProc download of a VLD file. Have a good winter.
Many thanks for the quick-fix @Captainkirkdawson!
Longer term fix still needed
@geoffj-FUG to source a block of VLDs and upload to test3. @AnneV-Learn to set aside a substantial block of time in September/onwards to work on this issue only.
I have now uploaded a range of files for different years and different sizes into test3 user id somt.cen County SOM for this exercise. They are 1841 - HO107_929, 930, 940, 945 1851 - HO51_923, 924, 925 1871 - RG10_350, 351, 352 1881 - RG11_2359, 2360, 2361.
Geoff
I have now downloaded each of these VLD files as a csvPRO file. I have added _a to the end of each one so that the current results are known and reloaded them to userid somt.cen on test3. We now have a baseline to compare any work that is done on this problem.
Geoff
@geoffj-FUG Fix has been deployed to Test3 if you could please test.
Anne Before I test this can you please advise what changes have been made. I believe that this issue affects the researchers reports as well as the vld downloads. I just want to be sure I am testing the right parts of the system. Geoff
Hi Geoff, Oh I only made changes to the VLD download area of code. I hadn’t appreciated that there was a problem with the researchers reports. Can you please clarify what the issue is there and what is required? Sorry, Anne
On 15 Sep 2022, at 01:37, geoffj-FUG @.***> wrote:
Anne Before I test this can you please advise what changes have been made. I believe that this issue affects the researchers reports as well as the vld downloads. I just want to be sure I am testing the right parts of the system. Geoff
— Reply to this email directly, view it on GitHubhttps://github.com/FreeUKGen/FreeCENMigration/issues/1432#issuecomment-1247437276, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARLANZYUGBSG6BJHWT7OR4DV6JVTBANCNFSM5Z2DR36A. You are receiving this because you were mentioned.Message ID: @.***>
Anne
This is complex.
I will write you an explanation of the root causes. The problem goes right back to the development of FC2 and a misunderstanding of the keys in FC1. This caused a series of issues which were all addressed by fix-ups. The underlying cause was not identified until recently and so we need to go back and apply the correct logic from the start.
I am back working with FreeCEN again now (I had another priority the last few week). I will write this up for you over the next few days.
Geoff
Thanks @geoffj-FUG. If you have examples of data where the issue is clearly visible in researchers reports that would be very helpful. Also is the researchers reports issue applicable to VLD data loaded into FC2 (via the application menu option), data loaded from CSVProc files generated from data originally loaded from VLD batch upload, FC2 CSV file data loaded via the FC2 application or all three?
Anne
FreeCEN2 has a problem in that it cannot download or report on repetitive schedule numbers of 0. This is a serious problem in 1841 as all schedule numbers are 0. Coordinators have also been manually fixing up Downloads of other years as the problem creates errors and therefore the problem records can be identified.
It appears that the coding in FreeCEN2 looks for a change in schedule number to identify when a household changes. Whilst this approach works in most cases it fails completely for 1841 pieces. A number of ad hoc fixes have been applied to get around the problem. They are not a complete solution.
The issue only occurs where the record is in the vld data set. The problem does not occur in the csv data set. It is therefore a FreeCEN1 to FreeCEN2 conversion issue. It is my belief that the problem exists simply because the FreeCEN1 keys have not been recognised in the design of FreeCEN2.
A notional key for a FreeCEN1 vld record when it is uploaded is a complex key incorporation three fields – piece number, household number and position in household. The record itself is a fixed length record in a flat file.
A typical vld record look like:
C 0000090003Withycombe 1 4 2 0 Coombe -VICKERY Louisa -ServntUF12 y-General Servant (Domestic) -SOMWithycombe -
Where
• C is a processing code used by the FC1 validation software (Valdrev). It has no use in FC2 and can be ignored.
• 00 are FC1 codes and can be ignored for this exercise.
• 0009 is the household number. This increments every time that a household changes. This is the household number in the notional key above.
• 0003 is the position of each person in the household. It increments for each record and resets to 1 when the household number resets.
• Withycombe is the Civil Parish.
• 1, 4 and 2 are the Folio, Page, and Schedule number.
• 0 is the street number
• Coombe is the address.
• The balance of the information is the information about an individual.
In my notional key above the piece number is recorded within the vld data set but is not uploaded with the records. It is presumably derived from the vld file name.
Our problem is the recognition of the change of household. Using the change of schedule number to flag the change is not working. We need to rework the code so that the change of household is identified by the change in household number.
I have previously been advised by Kirk that the household number is stored in the vld data set. It can therefore be used to identify the start and end of each household. This applies to both the vld conversion download and to the data extracted to respond to a request by a researcher.
By utilising this household number the schedule number of 0 for 1841 households can be inserted in the correct place in a download and for other years the problem of repetitive schedule numbers can also be addressed.
Geoff
Anne You ask Also is the researchers reports issue applicable to VLD data loaded into FC2 (via the application menu option), data loaded from CSVProc files generated from data originally loaded from VLD batch upload, FC2 CSV file data loaded via the FC2 application or all three? The problem is identified by the selection of a piece in the vld set and transferring it to the user folder. The piece is then converted to a csv and downloaded. The downloaded file contains the erroneous records. The problem is also identified by researcher's action reports that identify that households are reported with too many members and can also be broken up by a page number change. I also suspect that the story #748 that FC1 schedule 0 breaks FC2 is a symptom of the problem. Geoff
Thanks @geoffj-FUG. So the code modifications that I have made so far (and are deployed to Test3) do follow your specification (ie Schedule number triggered by change of household number) for the CSCProc file download of a VLD file. Which means (I believe) that when a CSVProc file generated in that way is incorporated (this is when ‘Search records’ are created and these are what the FC2 researcher searches use as their data source) the issue should be resolved. What I haven’t looked at thus far is the ‘upload new VLD file’ process (via Manage Counties ->Manage VLD files ) and the monthly VLD upload process. Do you believe there is a similar problem with Search records (ie the data that researcher searches use as their data source) created from those processes? If so I’ll look at those. That being the case , is there a need to retrospectively programmatically check (and correct where necessary) all Search records created from VLD files (this could be quite a computing resource heavy task) or will any existing data issues be corrected individually (by downloading the VLD file as a CSVProc file and loading that after deleting the VLD file), as and when they are identified - assuming the number of such issues is estimated to be quite low. Hope that makes sense!
Anne
The csv downloads are now downloading perfectly. We can migrate this to production. I still have an inkling that we had the same problem in the researchers results. I have not been across reporting so cannot be certain. However, I have tested this using 1841 data and have got the correct results. (I used a family surname that appeared in several consecutive families and the results broke up the families OK). I therefore believe that we may have cleared this issue as well. I have no sound information on story #748. I think that may have been the first instances of this problem Once migrated this story can be closed. Geoff
Anne
Sorry, I have just found another glitch. The system is adding a schedule number of 0 against every page number change on vld download converted to csv, This is the issue that I had reported sometime ago that caused me to suspect the researcher reporting. Is there a 'fix' still hanging around that is causing this? Geoff
Ok @geoffj-FUG I’ll take a look when I am back at my desk (that won’t be until next Wednesday as I am away at the moment). Are you able to identify a piece that illustrates the problem (as that would help enormously when trying to solve it)?
Anne
I have attached HO107_929_z to your copy of this email. Have a look at the Kitch family. I examined these to test as the surname does not change. At the start of each new page the schedule number of 0 is entered and the previous address is repeated.
Geoff
From: Anne Vandervord @.> Sent: Wednesday, 28 September 2022 10:46 PM To: FreeUKGen/FreeCENMigration @.> Cc: geoffj-FUG @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Incorrect schedule numbers in download of 1841 VLD files (Issue #1432)
Ok @geoffj-FUG https://github.com/geoffj-FUG I’ll take a look when I am back at my desk (that won’t be until next Wednesday as I am away at the moment). Are you able to identify a piece that illustrates the problem (as that would help enormously when trying to solve it)?
— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1432#issuecomment-1260853114 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFP23LNV42QO7JL5E33WAQ4ZRANCNFSM5Z2DR36A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFMEYOEBGMSF2SSPDBTWAQ4ZRA5CNFSM5Z2DR36KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOJMTRO6Q.gif Message ID: @. @.> >
👍🏻thank you
Anne I am now completely miffed. I have downloaded pieces from CON, GLS, KEN and SOM. They all convert to CSVPro properly except SOM. So, I tried different piece numbers. The early SOM pieces had the problem. The later ones didn’t. I have considered that the issue may have been created by an early version of valdrev and FCTools. My later SOM pieces would have been validated on a newer computer and so would have been set up with a clean download of the software. But we would have become aware of the issue years ago. The earliest SOM files were 1861 but they do not have the problem. 1861 has schedule numbers so the 0 schedules would not have been there. I have come to the conclusion that the problem is within my older 1841 SOM files. I will need to convert and clean them in due course. The simplest answer is that the transcriber put the extra 0 into the transcription and that the proofreader did not remove them. We just happen to have selected the problem pieces at random as our test pieces. Geoff
This is also covered by story #748. Given the findings above I am now more than happy with Anne's updated code. The updated code can be deployed and both stories closed. Geoff
Anne
This is complex.
I will write you an explanation of the root causes. The problem goes right back to the development of FC2 and a misunderstanding of the keys in FC1. This caused a series of issues which were all addressed by fix-ups. The underlying cause was not identified until recently and so we need to go back and apply the correct logic from the start.
I am back working with FreeCEN again now (I had another priority the last few week). I will write this up for you over the next few days.
Geoff
From: Anne Vandervord @.> Sent: Thursday, 15 September 2022 4:08 PM To: FreeUKGen/FreeCENMigration @.> Cc: geoffj-FUG @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Incorrect schedule numbers in download of 1841 VLD files (Issue #1432)
Hi Geoff, Oh I only made changes to the VLD download area of code. I hadn’t appreciated that there was a problem with the researchers reports. Can you please clarify what the issue is there and what is required? Sorry, Anne
On 15 Sep 2022, at 01:37, geoffj-FUG @. <mailto:@.> > wrote:
Anne Before I test this can you please advise what changes have been made. I believe that this issue affects the researchers reports as well as the vld downloads. I just want to be sure I am testing the right parts of the system. Geoff
— Reply to this email directly, view it on GitHubhttps://github.com/FreeUKGen/FreeCENMigration/issues/1432#issuecomment-1247437276, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARLANZYUGBSG6BJHWT7OR4DV6JVTBANCNFSM5Z2DR36A. You are receiving this because you were mentioned.Message ID: @. <mailto:@.> >
— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1432#issuecomment-1247623794 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFKGZO6VCTJ5SSDOKELV6K4L7ANCNFSM5Z2DR36A . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFL42GAPXOIGOU47SOTV6K4L7A5CNFSM5Z2DR36KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOJJOTU4Q.gif Message ID: @. @.> >
@geoffj-FUG to check before closing
I have downloaded an 1841 piece and the schedule numbers are now appearing correctly. I will close this story. Geoff
The problem process is a download of an 1841 VLD file. (I have sent a copy of the download to Anne in a separate email ). As you will see it has a schedule number of 0 on every row, instead of at the beginning of each schedule. This is caused by the restructuring of FreeCEN1 to FreeCEN2 several years ago and the loss of the FC1 key (or non use of the FC1 key?). It has caused problems for some time - shipping, workhouses, unoccupied buildings etc. It is now showing up once again in the CSVPro download of an 1841 VLD file. 1841 is different in that every schedule isschedule 0. This particular download probably needs re-coding. (the other years are OK because they use schedule numbers). If we can see the old vld key in the file that should help. We seem to have lost control of these repetitive schedule 0 once we moved away from vld files. However if the key is still in the structure we have a chance of resolving this problem. Geoff