FreeUKGen / FreeCENMigration

Issue tracking for project migrating FreeCEN to FreeCEN2 genealogy record database and search engine architecture. Code developed here is based on that developed in MyopicVicar
https://www.freecen.org.uk
Apache License 2.0
4 stars 3 forks source link

Check integrity of Site Stats #1428

Closed DeniseColbert closed 2 years ago

DeniseColbert commented 2 years ago

Check download of VLD files and subesquent uploads of CSV files after correction are not included in the site statistics.

AnneV-Learn commented 2 years ago

@DeniseColbert - I have checked the code and as far as I can see ALL CSV file uploads and incorporations are counted in Site Statistics. There is no check to see if a CSV file is 'replacing' a VLD file. It is not clear to me that the current db design of Site Stats would easily be able to detect that a CSV upload is replacing a VLD file - it would mean looking at data at the Piece level which it does not do currently, it accumulates stats at VLD/CSV file level. What are the steps that are actually used to ' download a VLD file and subesquently upload as a CSV file after correction'?

DeniseColbert commented 2 years ago

Hi @AnneV-Learn, thanks for looking into this. I'm not sure if Brenda has answered your question (in the email sent offline):

What are the steps that are actually used to ' download a VLD file and subesquently upload as a CSV file after correction'?

But if not, this video might help? https://youtu.be/XH-QBKRg9gQ

(full set of videos is here https://www.youtube.com/playlist?list=PLR9WizgFnUWzpzO0GzeT-fRbLYRfBtIls in case the info you need might be in another of them)

AnneV-Learn commented 2 years ago

@DeniseColbert, @geoffj-FUG - so if a VLD file is being replaced by a CSVProc file, is the VLD file deleted and then the CSV file uploaded as if the VLD had never existed?

geoffj-FUG commented 2 years ago

Anne Yes the vld file is deleted. The record count is usually 1:1 except if a record that was missed has been added. However the file name may have been amended where the PARMS has the new structure involving A B etc after the number. The process is

DeniseColbert commented 2 years ago

@geoffj-FUG how frequently are CSVs replaced/reloaded? It would be a large amount of work, and so if it's not massively misrepresenting our additions, I imagine it could be low priority.

geoffj-FUG commented 2 years ago

I do about one a month for each County. One inn May was over 20,000 records so the variation can be significant. The problem is masked by the fact that 2 Coordinators are still using vld files and uploading them. If the system was static we would see the change in totals and just be able to use that adjustment. The problem is a result of the continuation of vld uploads. How are we measuring the additional files uploaded by those Coordinators? These deletions could have an impact on our stats. The question is - Are the Exec and Trustees happy to accept the stats as at present? Geoff

PatReynolds commented 2 years ago

Developers need to build a count of VLDs in the database, and a change on the last month.

DeniseColbert commented 2 years ago

Geoff suggested a SUM of CSV and VLD records to be added to Site Stats table and CSV download report (extra column: New Records Added)

AnneV-Learn commented 2 years ago

@geoffj-FUG @DeniseColbert I have started looking at this in more depth and have the following observations and a question.

Currently Search records shown in Site Statistics is calculated as follows: Count of Search records with an system assigned creation date (actually part of the unique Identifier of the record) up to midnight on previous day. I am thinking that quite possibly it could be used to indicate what has been added to the database (New Records) in the last ‘month’ (by comparing counts for the day and 30 days earlier).

If we went down the originally suggested route of counting entries in VLD and CSV files (and comparing date and 30 days earlier) to arrive at ‘New Records’ wouldn’t we have to look at the ‘online’ incorporation date date for the VLD/CSV file otherwise we will be including records that are not yet available to researchers. Using the ‘online’ date does not get over the existing issue where re-loaded files will be counted as new records.

A question however we arrive at the ‘New Records’ figure, will it replace the ‘Individuals Report’ or be an additional on (New Records Report)?

Screenshot 2022-07-16 at 14.42.29.png

geoffj-FUG commented 2 years ago

Anne, Denise

As I see it anything uploaded to the csv dataset can be counted against the total count for that month.

The problem comes with the vld dataset. It has two occurrences affecting our count –

New records being added by either of the two Coordinators still using the FC1 FCTools and Valdrev.

Records removed from the dataset so that an updated file can be uploaded to the csv dataset. The two files may or may not have the same record count.

The contents of the csv monthly report reflects work done on csv files in the past month because vld pieces deleted and replaced by csv pieces have had to be re-validated.

Thinking aloud we know what was uploaded to the csv dataset based on the date and we know what was uploaded to the vld dataset presumably based on a similar date. (That’s how I interpreted Anne’s post. Please correct me if I am wrong). These are all Incorporated records and so are available for searching. Our problem appears to be that we do not have a way of tracking what was removed from the vld dataset.

Still thinking aloud: (vld records total ) - (vld records added + last months vld records) = vld records removed.

So, we can calculate the number of vld records deleted.

I suggested that if we add the csv total records to the vld total records and subtracted last months total we could get a more accurate total of records added. My initial suggestion was that it should happen at the grand total level.

However, given my thoughts above we can actually calculate the missing figure.

Given the letter that Pat has prepared for the Trustees (Denise, I presume that you are across it) maybe we should start at the grand total level and only drill down to County level when we see the results of that exercise. Then we can decide what to do from that point and even reconsider if there is a benefit of drilling down to County level.

Geoff

From: Anne Vandervord @.> Sent: Sunday, 17 July 2022 12:00 AM To: FreeUKGen/FreeCENMigration @.> Cc: geoffj-FUG @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Check integrity of Site Stats (Issue #1428)

@geoffj-FUG https://github.com/geoffj-FUG @DeniseColbert https://github.com/DeniseColbert I have started looking at this in more depth and have the following observations and a question.

Currently Search records shown in Site Statistics is calculated as follows: Count of Search records with an system assigned creation date (actually part of the unique Identifier of the record) up to midnight on previous day. I am thinking that quite possibly it could be used to indicate what has been added to the database (New Records) in the last ‘month’ (by comparing counts for the day and 30 days earlier).

If we went down the originally suggested route of counting entries in VLD and CSV files (and comparing date and 30 days earlier) to arrive at ‘New Records’ wouldn’t we have to look at the ‘online’ incorporation date date for the VLD/CSV file otherwise we will be including records that are not yet available to researchers. Using the ‘online’ date does not get over the existing issue where re-loaded files will be counted as new records.

A question however we arrive at the ‘New Records’ figure, will it replace the ‘Individuals Report’ or be an additional on (New Records Report)?

— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1428#issuecomment-1186191716 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFKGF6IJ4LPNWLYVRT3VUK54RANCNFSM5XQI76TA . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFLH3SJXSRVF6HGPLFDVUK54RA5CNFSM5XQI76TKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOI2Z5SZA.gif Message ID: @. @.> >

AnneV-Learn commented 2 years ago

@geoffj-FUG Sorry but I would just like to check that I am understanding your logic.

I have tried to make the calculation for Total Records (CSV + VLD) for 20th July based on my understanding of your formula using the numbers reported on Site Stats page.

A. Total VLD records @ 20th July = 42,241,109 B. VLD records added (New VLD records column) = 62,567 C. Total VLD records @ 20th June = 42,257,640 D. Total CSV records @ 20th July = 6,347,404

E. VLD records removed = A - (B+C)

= 42,241,109 - (42,257,640 + 62,567)

= -79,098

F. Total Records @ 20th July = A + D - E

= 42,241,109 + 6,347,404 - 79,098 (ignore the sign on E ?) = 48,509,415

NOTE: Search records @20th July reported on Site Stats as 44,967,143

Does that look correct?

geoffj-FUG commented 2 years ago

Anne

Total VLD records at 20 June = C = 42,257,640

Total records at 20 July = total VLD at 20 July + total CSV at 20 July = A + D = 42,241,109 + 6,347,404 = 48,588,513

VLD at start = C

VLD added = B

Potential VLD records at end = B + C = 42,320,207

Actual VLD records at end = A = 42,241,109

VLD records removed = (B + C) – A = 42,320,207 – 42,241,109 = 79,908 (I rearranged the formula to arrive at a positive number).

Your F calculation appears to me to have deducted E twice. E will already have disappeared from the total VLD count because the records have been removed so should not be deducted. (Yell at me if my logic is wrong).

The number of VLD records removed is quite feasible. I have removed about 51,350 records from the vld dataset and reloaded them as csv myself during that time period (an unusual number but I had a backlog to address that included a couple of big files).

We appear to have a problem – The difference between the official stat at 20 July and my calculation is 3,621,370. (I.e. 44,967,143 - 48,588,513 from above). Before we do anything else, we need to investigate that discrepancy or we will compound it.

Geoff

From: Anne Vandervord @.> Sent: Wednesday, 20 July 2022 9:58 PM To: FreeUKGen/FreeCENMigration @.> Cc: geoffj-FUG @.>; Mention @.> Subject: Re: [FreeUKGen/FreeCENMigration] Check integrity of Site Stats (Issue #1428)

@geoffj-FUG https://github.com/geoffj-FUG Sorry but I would just like to check that I am understanding your logic.

I have tried to make the calculation for Total Records (CSV + VLD) for 20th July based on my understanding of your formula using the numbers reported on Site Stats page.

A. Total VLD records @ 20th July = 42,241,109 B. VLD records added (New VLD records column) = 62,567 C. Total VLD records @ 20th June = 42,257,640 D. Total CSV records @ 20th July = 6,347,404

E. VLD records removed = A - (B+C)

= 42,241,109 - (42,257,640 + 62,567)

= -79,098

F. Total Records @ 20th July = A + D - E

= 42,241,109 + 6,347,404 - 79,098 (ignore the sign on E ?) = 48,509,415

NOTE: Search records @20th https://github.com/20th July reported on Site Stats as 44,967,143

Does that look correct?

— Reply to this email directly, view it on GitHub https://github.com/FreeUKGen/FreeCENMigration/issues/1428#issuecomment-1190184037 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AKCPIFN7RU3QRV3IXSU7VRTVU7STJANCNFSM5XQI76TA . You are receiving this because you were mentioned. https://github.com/notifications/beacon/AKCPIFLE753NBQZY72UG5B3VU7STJA5CNFSM5XQI76TKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOI3YMIZI.gif Message ID: @. @.> >

AnneV-Learn commented 2 years ago

Reply from Geoff:

Anne

Total VLD records at 20 June = C = 42,257,640

Total records at 20 July = total VLD at 20 July + total CSV at 20 July = A + D = 42,241,109 + 6,347,404 = 48,588,513

VLD at start = C

VLD added = B

Potential VLD records at end = B + C = 42,320,207

Actual VLD records at end = A = 42,241,109

VLD records removed = (B + C) – A = 42,320,207 – 42,241,109 = 79,908 (I rearranged the formula to arrive at a positive number).

Your F calculation appears to me to have deducted E twice. E will already have disappeared from the total VLD count because the records have been removed so should not be deducted. (Yell at me if my logic is wrong).

The number of VLD records removed is quite feasible. I have removed about 51,350 records from the vld dataset and reloaded them as csv myself during that time period (an unusual number but I had a backlog to address that included a couple of big files).

We appear to have a problem – The difference between the official stat at 20 July and my calculation is 3,621,370. (I.e. 44,967,143 - 48,588,513 from above). Before we do anything else, we need to investigate that discrepancy or we will compound it.

Geoff

AnneV-Learn commented 2 years ago

@geoffj-FUG I have done some investigation into the underlying database records and I believe that the discrepancy arises because the CSV Record count includes ‘unincorporated’ records. For Example:

Year = 1911 Search records = 44,187 VLD records = 0 CSV total records = 313,955 CSV records not incorporated = 269,768 CSV records incorporated = 4,4187 VLD + CSV incorporated = 0 + 4,4187 = 4,4187

Year = 1861 Search records = 12,711,188 VLD records = 12,237,527 CSV total records = 990,927 CSV records not incorporated = 370,982 CSV records incorporated = 619,945 VLD + CSV incorporated = 12,237,527 + 619,945 = 12,857,172 — close enough as the database is a moving target???

Now looked at a specific Year and County with count of records & individuals:

Year =1861 and County=CAM Search records = 59,670 VLD records = 56,933 (individuals = 55,976) CSV total records = 3,694 (individuals = 3,671) CSV records not incorporated = 0 CSV records incorporated = 3,694 (individuals = 3,671) VLD + CSV incorporated = 56,933 + 3,694 = 60,627 (individuals = 55,976 + 3,671 = 59,647) — closer when looking at individuals

geoffj-FUG commented 2 years ago

Anne

I am happy with that explanation for the difference in actual and reported. Thank you for the reconciliation.

As 1911 is mostly Somerset I had a look at my stats and your total records Incorporated makes sense.

We need to make sure that our count is based on Incorporated records. Actual records can be duplicated as pieces are copied from the transcriber to the proofreader to the validator. Unincorporated records are not searchable.

My formulae were as follows:

Total VLD records at 20 June = C = 42,257,640

Total records at 20 July = total VLD at 20 July + total CSV at 20 July = A + D = 42,241,109 + 6,347,404 = 48,588,513

VLD at start = C

VLD added = B

Potential VLD records at end = B + C = 42,320,207

Actual VLD records at end = A = 42,241,109

VLD records removed = (B + C) – A = 42,320,207 – 42,241,109 = 79,908 (I rearranged the formula to arrive at a positive number).

When I look at these formulae C is a correct figure as all VLD records are searchable (they are not uploaded until they are).

A and B are therefore also correct.

D needs re-defining. It should be total CSV records Incorporated, instead of total CSV records as Unincorporated records are not searchable.

Now I think we are both on the same track.

Geoff

AnneV-Learn commented 2 years ago

Geoff,

Ok now to try and simplify things - as I think we can agree the number of search records reported in Site Stats is correct. If we are just looking to arrive at the number of records added to the database in last 30 days can we not just take the number of Search Records reported in Site Stats for ‘today’ and number of Search records reported for ‘today’ - 30 days?

KR Anne

geoffj-FUG commented 2 years ago

Anne

That will achieve the result. It just depends on what detail Pat and the Trustees require.

Geoff

PatReynolds commented 2 years ago

@DeniseColbert is best placed to answer your question.

DeniseColbert commented 2 years ago

Search record calculation is fine, could in theory replace what exists currently, @DeniseColbert to check and let Anne know.

DeniseColbert commented 2 years ago

@AnneV-Learn to look at day-on-day difference calculation for individuals report

AnneV-Learn commented 2 years ago

@DeniseColbert ready for testing on TEST3. I have replaced the Individuals report with Search Records report (this represents the number of records currently in the database - see discussion with Geoff above). Be aware that the data on TEST3 is rather 'flaky' and that the Site Stats data on Test3 is not totally up to date. I tested for period 01/03/2022 - 31/03/2022 and it appeared to be ok.

DeniseColbert commented 2 years ago

Thanks @AnneV-Learn :)

Looks good to me (if there were only LND records added to Test3 in March?)

image

AnneV-Learn commented 2 years ago

Yes @DeniseColbert only LND records were added to Test3 in March.

DeniseColbert commented 2 years ago

Brilliant, this can be deployed then! Thanks Anne

DeniseColbert commented 2 years ago

Done, closing