Closed cjrace closed 2 weeks ago
Copying in @jen-machin for sight, though expecting @rmbielby will have the most opinions on reviewing as it's pulling across data from the data screener.
I think the only thing I can comment on really is the splitting of scripts.
I'm in favour of your updated method and condensing the scripts.
Perhaps make clear a family of functions always have a family prefix e.g., "pretty_"? (I know this touched upon in the contributing so just me being picky).
For the re-organisation, I'm wary it's just going to get messy the way you've got it. I agree with the lines:
But I'm not massively into how you're using internal_functions and helper_functions.
Thoughts are:
Agree with you on the proposed reshuffling @rmbielby - I specifically asked about it as I wasn't fully happy with what I had and I'd already been through a couple of variations and found myself tripping up over the code and trying to keep it easy to find as I went.
Like your solution the best of anything I've tried so far. Here's the updated contributing rules I've followed in my lat-ees-t commit:
In an R package you are not allowed to have any sub directories in the R/ folder. Where possible we should have:
The script should share the name of the function or family. If needed then a
Documentation for all data shipped with the packages is kept in R/datasets_documentation.R
. Scripts used for preparing data used in the package is not in the R folder, it is in the data-raw/
folder, helper functions for this can be found in the R/datasets_utils.R
folder.
utils.R
should be used to hold any cross-package helpers that aren't exported as functions or specific to a family.
Every exported function or data set gets its own test script, called test-<function-name>.R
or test-data-<data-name>.R
if for a data set.
This may be counter-intuitive, but given you've taken my suggestion and gone with it, I feel like I need to post-justify it with some more informed opinions than just what seemed like a good idea to me at the time... I've done a bit of scanning through blogs and there's some interesting stuff in the following articles that's relevant for anyone reading back through this chain (and for us going forward with other packages). Think it broadly fits in with our thinking.
Before you start reading, go make a brew ☕ no PR with this many file changes will be a quick read...
Brief overview of changes
Added the Ward-PCon-LAD-LA hiearchy lookup file into the package (currently at https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/ward_lad_la_pcon_hierarchy.csv), joined on region and country columns and added some helper functions to pull unique locations from that file for Ward, PCon, LAD, LA, and Region level.
Added a countries dataset too, to replace https://github.com/dfe-analytical-services/dfe-published-data-qa/blob/main/data/country.csv.
Also moved some of the code around in a way that I'm hoping will future proof the package a bit. I had some internal debates on this that I've detailed at the end of the PR, as well as some specific questions I'd like a review to feedback on.
Why are these changes being made?
Want to make this easier for everyone in DfE, moving the lookup file allows us to start moving away from relying on the screener repo, and puts the lookups at the fingertips of R users.
Will be a really nice benefit for us too for using in the screener and also the edustats briefing tool (which is currently running off of this branch).
Detailed description of changes
Added a wrapper function for getting lookup data from the ONS API rather than downloading the files manually. Hoping this will prove useful for future lookups too. It has its quirks...
Exported it in case it's helpful for anyone, though strongly suspect it'll only be us who use it.
This is documented in the
data-raw/
folder, folding the package guidance on https://r-pkgs.org/data.html.Created functions in the internal_functions.R script to help neaten the code a bit so hopefully it's easier to use moving forwards.
I've recreated the hierarchy file from the data screener using the ONS API so we can have a more reproducible method when we next come to update it.
I did various bits of QA of the data added against the existing data in the data screener repo and what was downloadable from the Open Geography Portal to dual run with the API. I'm happy that the data we have matches the screener data.
Have added some unit tests to make sure the expected numbers of locations are returned.
Using
waldo::compare()
the difference I found was (suggests the new lookup I've made here is a slight improvement on what we had before):One location in over 24,000 has a code that fails the usual pattern, I think it's safe to assume this was a typo in the ONS 2017 file where they missed a 0 from the PCon code, have fixed this in a commented section for manual fixes and noted it in the documentation for the data set so users are aware.
The change is to make Glasgow East's PCon code
S14000030
instead ofS1400030
Making new 'fetch' functions from lookups should be pretty speedy moving forwards too.
Package code reshuffle
comma_sep()
andround_five_up()
intohelper_functions.R
(as we already list under helper functions in the pkgdown site)create_project()
I haven't changed anything about this function, though did a quick refactor of the validation code as the way it was written before wasn't being styled properly by
styler
. You'd run styler and then lintr would complain, if you then address lintr's issues styler would undo it next time you reran it.The refactor now plays more nicely with styler / lintr and I think this has accidentally resolved #77 too. All the tests are still passing so don't think this needs much reviewing, though tagging @JT-39 for sight on this as the original author of that function.
Additional information for reviewers
Future plans
We want to add more look up files that are currently in the data screener and associated 'fetch' functions:
We should make an extra table / options for the custom locations we've added like 'Unknown' and 'Outside of England' etc. too.
Questions I'm interested in thoughts on
I've added some notes into the contributing file describing this, though interested on thoughts about the approach? At a a high level it is
Then
For the creation of data sets, they get their own scripts in
data-raw/
, I expect these will use some of the helper functions (both exported and internal only) to make them more readable and easier to update.Question is, does that make sense as an approach?
There's nothing on the ONS API wrapper as there didn't seem anything obvious to test.
There is also currently no tests on internal non-exported functions.
Question is, I'm not sure if there's more we should be checking here or we're fine with this?
Regions Currently only England has regions. Which makes me wonder if: a) we stick as we are now a1) and therefore should we allow Scotland / Wales / NI as regions? (We know some want it somtimes) or b) we should include an alternative level for Scotland / Wales / NI as regions, and in the big lookup file join with those instead for those countries? (thought we might be able to do counties, but ONS only seem to publish them for England too)
To fetch or not fetch Currently regions and countries have fetch functions that just pull in their data. Maybe they don't need them? Maybe it's helpful? Maybe we need to expand to have a flag for custom DfE locations? Thoughts?
Issue ticket number/s and link
77 is the only issue linked to this