dfe-analytical-services / dfeR

Common R tasks in the Department for Education (DfE)
https://dfe-analytical-services.github.io/dfeR/
GNU General Public License v3.0
9 stars 2 forks source link

Add geography lookup data and functions #81

Closed cjrace closed 2 weeks ago

cjrace commented 1 month ago

Before you start reading, go make a brew ☕ no PR with this many file changes will be a quick read...

Brief overview of changes

Added the Ward-PCon-LAD-LA hiearchy lookup file into the package (currently at https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/ward_lad_la_pcon_hierarchy.csv), joined on region and country columns and added some helper functions to pull unique locations from that file for Ward, PCon, LAD, LA, and Region level.

Added a countries dataset too, to replace https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/country.csv.

Also moved some of the code around in a way that I'm hoping will future proof the package a bit. I had some internal debates on this that I've detailed at the end of the PR, as well as some specific questions I'd like a review to feedback on.

Why are these changes being made?

Want to make this easier for everyone in DfE, moving the lookup file allows us to start moving away from relying on the screener repo, and puts the lookups at the fingertips of R users.

Will be a really nice benefit for us too for using in the screener and also the edustats briefing tool (which is currently running off of this branch).

Detailed description of changes

  1. ONS API query

Added a wrapper function for getting lookup data from the ONS API rather than downloading the files manually. Hoping this will prove useful for future lookups too. It has its quirks...

Exported it in case it's helpful for anyone, though strongly suspect it'll only be us who use it.

  1. Creating the data sets

This is documented in the data-raw/ folder, folding the package guidance on https://r-pkgs.org/data.html.

Created functions in the internal_functions.R script to help neaten the code a bit so hopefully it's easier to use moving forwards.

  1. Lookup data

I've recreated the hierarchy file from the data screener using the ONS API so we can have a more reproducible method when we next come to update it.

I did various bits of QA of the data added against the existing data in the data screener repo and what was downloadable from the Open Geography Portal to dual run with the API. I'm happy that the data we have matches the screener data.

Have added some unit tests to make sure the expected numbers of locations are returned.

Using waldo::compare() the difference I found was (suggests the new lookup I've made here is a slight improvement on what we had before):

image

screener <- read.csv("https://raw.githubusercontent.com/dfe-analytical-services/dfe-published-data-qa/main/data/ward_lad_la_pcon_hierarchy.csv")

waldo::compare(
  screener %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name), #%>% select(ward_name, ward_code),
  dfeR::wd_pcon_lad_la %>% arrange(ward_code, pcon_code, lad_code, new_la_code, ward_name, pcon_name, lad_name, la_name) #%>% select(ward_name, ward_code)
)

One location in over 24,000 has a code that fails the usual pattern, I think it's safe to assume this was a typo in the ONS 2017 file where they missed a 0 from the PCon code, have fixed this in a commented section for manual fixes and noted it in the documentation for the data set so users are aware.

The change is to make Glasgow East's PCon code S14000030 instead of S1400030

image

  1. Fetch functions I'm hoping I've made these self-explanatory in the documentation - so rather than make this description any longer I'll use this as the first test of that!

Making new 'fetch' functions from lookups should be pretty speedy moving forwards too.

  1. Package code reshuffle

    • Moved code of conduct into .github folder
    • Moved comma_sep() and round_five_up() into helper_functions.R (as we already list under helper functions in the pkgdown site)
    • Made a new family for formatting years, combining the format_ay and format_fy scripts
    • Made a new prettying family for pretty num, filesize and time taken functions
  2. create_project()

I haven't changed anything about this function, though did a quick refactor of the validation code as the way it was written before wasn't being styled properly by styler. You'd run styler and then lintr would complain, if you then address lintr's issues styler would undo it next time you reran it.

The refactor now plays more nicely with styler / lintr and I think this has accidentally resolved #77 too. All the tests are still passing so don't think this needs much reviewing, though tagging @JT-39 for sight on this as the original author of that function.

Additional information for reviewers

Future plans

We want to add more look up files that are currently in the data screener and associated 'fetch' functions:

We should make an extra table / options for the custom locations we've added like 'Unknown' and 'Outside of England' etc. too.

Questions I'm interested in thoughts on

  1. Splitting of code / scripts I've tried to combine our scripts a bit more and lay out a template / standard for how we approach it moving forwards as I'm wary we can't have subdirectories in the R folder for a package, meaning we could end up with lots of scripts. and we could easily lose time to having code in inconsistent places and struggling to find it.

I've added some notes into the contributing file describing this, though interested on thoughts about the approach? At a a high level it is

Then

For the creation of data sets, they get their own scripts in data-raw/, I expect these will use some of the helper functions (both exported and internal only) to make them more readable and easier to update.

Question is, does that make sense as an approach?

  1. Testing There's some tests on the data sets and the exported functions.

There's nothing on the ONS API wrapper as there didn't seem anything obvious to test.

There is also currently no tests on internal non-exported functions.

Question is, I'm not sure if there's more we should be checking here or we're fine with this?

  1. Regions Currently only England has regions. Which makes me wonder if: a) we stick as we are now a1) and therefore should we allow Scotland / Wales / NI as regions? (We know some want it somtimes) or b) we should include an alternative level for Scotland / Wales / NI as regions, and in the big lookup file join with those instead for those countries? (thought we might be able to do counties, but ONS only seem to publish them for England too)

  2. To fetch or not fetch Currently regions and countries have fetch functions that just pull in their data. Maybe they don't need them? Maybe it's helpful? Maybe we need to expand to have a flag for custom DfE locations? Thoughts?

Issue ticket number/s and link

77 is the only issue linked to this

cjrace commented 1 month ago

Copying in @jen-machin for sight, though expecting @rmbielby will have the most opinions on reviewing as it's pulling across data from the data screener.

JT-39 commented 1 month ago

I think the only thing I can comment on really is the splitting of scripts.

I'm in favour of your updated method and condensing the scripts.

Perhaps make clear a family of functions always have a family prefix e.g., "pretty_"? (I know this touched upon in the contributing so just me being picky).

cjrace commented 3 weeks ago

Create geography lookup functions in dfeR

rmbielby commented 3 weeks ago

For the re-organisation, I'm wary it's just going to get messy the way you've got it. I agree with the lines:

But I'm not massively into how you're using internal_functions and helper_functions.

Thoughts are:

cjrace commented 3 weeks ago

Agree with you on the proposed reshuffling @rmbielby - I specifically asked about it as I wasn't fully happy with what I had and I'd already been through a couple of variations and found myself tripping up over the code and trying to keep it easy to find as I went.

Like your solution the best of anything I've tried so far. Here's the updated contributing rules I've followed in my lat-ees-t commit:

Folder and script structure conventions

In an R package you are not allowed to have any sub directories in the R/ folder. Where possible we should have:

The script should share the name of the function or family. If needed then a _utils.R script should be used for internal only functions that relate to a specific family.

Documentation for all data shipped with the packages is kept in R/datasets_documentation.R. Scripts used for preparing data used in the package is not in the R folder, it is in the data-raw/ folder, helper functions for this can be found in the R/datasets_utils.R folder.

utils.R should be used to hold any cross-package helpers that aren't exported as functions or specific to a family.

Every exported function or data set gets its own test script, called test-<function-name>.R or test-data-<data-name>.R if for a data set.

rmbielby commented 3 weeks ago

This may be counter-intuitive, but given you've taken my suggestion and gone with it, I feel like I need to post-justify it with some more informed opinions than just what seemed like a good idea to me at the time... I've done a bit of scanning through blogs and there's some interesting stuff in the following articles that's relevant for anyone reading back through this chain (and for us going forward with other packages). Think it broadly fits in with our thinking.