Data completion analysis

dotloadmovie commented 1 year ago

OUT OF SCOPE FOR ORIGINAL BUDGET> Great idea but no money to do this right now. If we get additional money, it will be used to reduce ongoing running costs and NOT to dev new functionality/features.

Report for Rashid and CA re: how much data is held/how complete it is. One off? No, a process that they want to have in place ON the platform so that they can see progress and completeness of each of their LAs and how far along the upload they have got and for which data sets.

MagicMiranda commented 12 months ago

@dotloadmovie could you add some basic detail to the Description box above too please? Thanks

MichaelHanksSF commented 6 months ago

Once pan-agg has been created, log created for hub with summary of dataset by file by year by LA that tell the hub for each file/year/LA combo:

successfully processed
unsuccessfully processed
no file uploaded

MichaelHanksSF commented 1 month ago

Add requirement from additional draft card:

[ ] User that uploads is able to see what they still need to upload and when

MichaelHanksSF commented 1 month ago

Add requirement from additional draft card:

[ ] ETL code to work out how complete an upload is eg 95% of files for 903 received. Include status on the folder?

patrick-troy commented 3 weeks ago

Potential fix to show the missing headers in the log file and save LA analysts time in rectifying this issue:

In the pipeline if unable to identify headers we could output a list of the just the missing headers to the log file. We could store the matches for each table, identify which has the most matches and assume that is the table they're trying to upload. e.g.

attempted 2016 ssda903 UASC upload
expected headers: CHILD, DOB, SEX, DUC
actual headers: CHILD, DOB, GENDER, DUC
3 matches/ 75% match rate -> expected table UASC
based on expected table, show SEX as missing header

One issue I can foresee with this is other tables have similar headers and therefore matches e.g.

attempted 2016 ssda903 UASC upload
expected headers: CHILD, DOB, SEX, DUC
actual headers: CHILD, DOB, SEX, DUX (this time SEX matches but DUC doesn't)
Header file expected headers: CHILD, DOB, SEX, ETHNIC, UPN, MOTHER, MC_DOB

As you can see there would be 3 matches for both UASC and Header files. Using the % and picking highest % might help negate this but that might not always be the case (e.g. if the Header file contained the same number of columns as UASC)

SocialFinanceDigitalLabs / sf-fons-platform

Data completion analysis #39