Public-Health-Scotland / source-linkage-files

This repo is for the syntax used for the PHS Source Linkage File project
https://public-health-scotland.github.io/source-linkage-files/
Other
4 stars 2 forks source link

Investigate and change format of certain variables for 2014-2016 in episode and individual files #860

Open SwiftySalmon opened 11 months ago

SwiftySalmon commented 11 months ago

There's a problem when running read_slf_individual and probably also with the episode file that was discovered when running the dementia IR. If you read in any year between 2014 and 2016 and any year after that, some of the variables are in different formats between years and you get an error message:
Error in compute.arrow_dplyr_query():! Invalid: UnionNode input schemas must all match, first schema was: year: string

The current workaround is to read the years separately, change the format, and bind them together.

We'll need to either rerun the 2014-16 files or open the files change the variables and save. We don't have a comprehensive list of which variables are affected so this will require a bit of investigation.

Jennit07 commented 9 months ago

I've been investigating this issue and thought this could potentially have a 'quick' fix by bringing the variables in the older files up to date with 1718 onwards. However there is some mismatch between the variables e.g. some missing in 1617 but now available in 1718 or vice versa. I tried to change the types of some of these variables but have not been successful in reading in old files in addition to the newer files.

I have had a look in sourcedev and we still have the extracts from the last time this was ran so my suggestion to fix this is to run the older years against the R code (as an ad-hoc update) which should correct this issue. This will also bring any mismatching variables up to date with the current process. This would also mean the underlying data will not have any differences for analytical use. This should also be done separately to the upcoming march update to prevent any overlap. A new issue to run the older files against the existing code has been created in #893

I have also updated the episode file and individual file layout documents to bring this in line with any missing variables or variables which we no longer have in the file. in #887

github-actions[bot] commented 4 months ago

This issue is stale because it has been open approximately 5 months with no activity.