Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

Matching geography codes with the area names #18

Closed jvillacampa closed 4 years ago

jvillacampa commented 4 years ago

As in issue #13. Function that matches a column containing geography codes with their respective area names. I don't think I have done a very good job with the tests and covering potential problems with user's inputs, so any feedback very welcomed.

fixes #13.

jackhannah95 commented 4 years ago

Hi @jvillacampa. Just to say that @davidc92 and @lucindalawrie are going to do a review of this and get back to you soon.

jvillacampa commented 4 years ago

Cheers David, I will take a look to your comments in the next few days

jvillacampa commented 4 years ago

Hi @davidc92, sorry I pushed some changes, but I still need to deal with some of your comments from before! I'll let you know when I have got everything ready. Cheers.

jvillacampa commented 4 years ago

Hi @davidc92, thank you for all the comments. I think I have dealt with all of them now and is ready for a new round of review. A couple of points:

davidc92 commented 4 years ago

Hi @davidc92, thank you for all the comments. I think I have dealt with all of them now and is ready for a new round of review. A couple of points:

  • The code extracting the info from the API is in the function at the moment, but it won't run as the reference file exists. Not sure how to ensure this file is kept up to date in a low-maintenance way.
  • The devtools::check fails as during the tests (test_that) it fails as it can't open the reference files I use. Slightly confused about why is this. Could this be solved using the package here?

Adding here() might actually solve some of the tests failing - give that a go and see if it works. Also, you'll have to update one of the unit tests for file_size(). Line 8 is testing that the number of files in the tests directory is equal to 8, but you have also added a test file so the 8 should become 9 :)

jvillacampa commented 4 years ago

Using "here" doesn't help in this case. Apparently this is a well known issue. I have tried the proposed workaround but still fails. I have done a workaround replicating one of the files and taking out one of the tests that I don't think added much.

However, after doing these changes still fails. The problem is that when the function runs it doesn't recognize that the lookup in the reference files folder exists and then it tries to connect to the open data platform, which could be fine, but for some reason it doesn't find the function getURL from RCurl (which is a dependency of SPARQL). I have tried a couple of things, but a bit stuck here.

The tests run perfectly fine manually, fail as described above for devtools::checks(), and for devtools::test() they connect to the open data platform but fail when it comes to the line where it saves the data in the reference file folder.

I have pushed the latest version of what I was trying, but any help/advice welcome.

jvillacampa commented 4 years ago

Ok @davidc92, I figured out a workaround for this issue. I had to save another copy of the data in the testthat folder. devtools::check don't bring any errors now.

I have also adjusted file types and folder names to follow the recommended ones here. The code that creates the data is in data-raw.

Sorry it took me so long to look at this again!

jvillacampa commented 4 years ago
1. Can R version 3.5.0 be added to the description file as a dependency? Our `.rds` files can't be loaded by older versions of R and this is throwing out a warning. Adding this will clear it.

2. `devtools::check()` is flagging non-standard files and folders at the top level - can you add the data-raw folder to the `.Rbuildignore` file?

3. `area_names_lookup` is being flagged as an undefined global variable. You can fix that by adding `.data$` at the start of all mentions of it in the code in the function e.g. `area_name_lookup$geo_code` should become `.data$area_name_lookup$geo_code`. `.data$` is a function from rlang which is already listed as an import in the namespace.

4. I don't really like that we have a folder named "data". To me the contents just look like the lookup file, so can that be moved to a "reference_files" folder instead? Does the script in "data-raw" need to be in a separate folder also?

Thanks!

  1. Done
    1. Done
    2. Done, your solution didn't work for me, so I found another one using utils::globalVariables.
    3. I was following the structure and naming conventions suggested by H. Wickham here which is also the same as the one from the devtools cheetsheet. I am not very bothered about it, if you want to change it.

I ran devtools::check and came without errors or notes, let's see what Travis say!. Thanks a lot for all the reviews!

jackhannah95 commented 4 years ago
  1. I was following the structure and naming conventions suggested by H. Wickham here which is also the same as the one from the devtools cheetsheet. I am not very bothered about it, if you want to change it.

I'll do a review of this function too, since there's been so much back and forward. Haven't gotten around to it yet so apologies if I'm sticking my craw in prematurely, but storing data in a folder called "reference_files" is a very ISD-specific practice that I haven't seen replicated in any other R packages. I would much prefer we follow devtools' conventions than ISD's ones.

jackhannah95 commented 4 years ago

Hi @jvillacampa,

Sorry, I have gone overboard with this. The more I thought about it, the more I thought that requiring a data frame to be provided to the function probably meant that it wasn't any less effort than just doing your own join to a lookup file. Some of the commits relate to small changes to the other functions at least.

I think the ease with a function like this would come with being able to just input directly anything from a single code to a column with a million of them and have it return the names in the same format. Something like:

df %>%
  mutate(area_name = match_area(geo_code))

I also thought it should be able to do the reverse, since both are available in the lookup. Please let me know if you're not happy with anything or have any improvements to suggest. Or if it's all terrible and you preferred the original way.

I took the same approach to saving, storing and calling data as the janeaustenr package. It seems to be working fine, and has the handy benefit that R >= 3.5 isn't required and no duplicates need to be created for unit tests. Again though, let me know if you think anything could be done better.

The SPARQL code extracts codes that have no accompanying area name, but I had to drop them as otherwise entering an NA name returned all those codes. In practice dropping them doesn't make a difference to the function as you intended it, because entering any code that isn't in the lookup dataset returns an NA anyway. It'd be more efficient to edit the SPARQL query to only extract codes with an accompanying area name, rather than bringing them all in and dropping all NA values. I tried editing the query and it was a disaster. Any help you're able to provide with that would be much appreciated.

Cheers,

Jack

jvillacampa commented 4 years ago

Hi @jackhannah95, thanks for doing that.

Linking name to code - I thought about it, but the main problem is that names are not unequivocal. I really don't think the way the function works would provide useful output. Or at least I don't think I would use to get codes from names as it would require me doing quite a bit of extra work for those cases with multiple codes, which are pretty common. I started taking all this functionality out the code, but I thought I better check it with you first.

I like what you have done so it takes a vector/variable instead of a data frame. Much neater.

I have a bunch of small edits that I will be doing, but I'll wait to hear from you before pushing stuff. Thanks, Jaime

jvillacampa commented 4 years ago

Just to say, that I think the feature of names to codes would be very useful, but it would need some more control, so users can decide the area type of their input (e.g. HB and they get S08 codes), and we will need a more complete lookup so more combinations of names are included (e.g. NHS Highland and not only Highland or Dumfries and Galloway and Dumfries & Galloway)

jackhannah95 commented 4 years ago

Hi @jvillacampa,

You know better than I do what'd be useful functionality in practice so yeah by all means make any changes you want to. I'm not precious about stuff being deleted.

jvillacampa commented 4 years ago

@jackhannah95 I uploaded some changes, all cosmetic apart from taking the "code" functionality. I think that would be very useful to have, but maybe for a future release I can figure out based on what you started. I tried to figure out how to drop the NAs from the SPARQL query but I failed, but your solution works well so not sure we need to look at it more. Thanks a lot for the work you have put into this, it's much better now!

jackhannah95 commented 4 years ago

Looks good @jvillacampa 👍 there are a few small things to do like delete some old tests and re-word the odd bit of documentation but nothing major.

I guess the only bits I would still quibble over are some of the names. code_var is more descriptive than x, granted, but I don't think many people will ever actually type out the variable name. We've called the first argument x in every other function (apart from file_size() but it takes a filepath rather than a value) and for the sake of consistency I'd prefer it to be the same in this one too, if that's okay.

Similarly, within the function itself, no_9char_codes is definitely more descriptive than n, but it's invisible to the user and I just wanted a concise way of getting that value into the warning message. I don't think it matters too much that it doesn't really describe what that value is (I think the comment is probably descriptive enough) so ideally I'd like to shorten that back to n too if that'd be alright.

Chances are I'll have to do something at the end to make travis build anyway so I'm happy to do those bits (it'll be one or two commits max) and give you another look before this is finally merged. Would that be alright?

jvillacampa commented 4 years ago

I hate undescriptive object names in functions, just because it makes much harder to fix it and figure out what the function is doing, but you can change them as your leaving gift from me ☺

I saw that Travis was having a tantrum again, but not sure why it was, thanks for looking into it.

Cheers and enjoy the weekend!

Jaime

From: Jack Hannah notifications@github.com Sent: 09 April 2020 17:42 To: Health-SocialCare-Scotland/phsmethods phsmethods@noreply.github.com Cc: VILLACAMPA, Jaime (PUBLIC HEALTH SCOTLAND) jaime.villacampa@nhs.net; Mention mention@noreply.github.com Subject: Re: [Health-SocialCare-Scotland/phsmethods] Matching geography codes with the area names (#18)

Looks good @jvillacampahttps://github.com/jvillacampa 👍 there are a few small things to do like delete some old tests and re-word the odd bit of documentation but nothing major.

I guess the only bits I would still quibble over are some of the names. code_var is more descriptive than x, granted, but I don't think many people will ever actually type out the variable name. We've called the first argument x in every other function (apart from file_size() but it takes a filepath rather than a value) and for the sake of consistency I'd prefer it to be the same in this one too, if that's okay.

Similarly, within the function itself, no_9char_codes is definitely more descriptive than n, but it's invisible to the user and I just wanted a concise way of getting that value into the warning message. I don't think it matters too much that it doesn't really describe what that value is (I think the comment is probably descriptive enough) so ideally I'd like to shorten that back to n too if that'd be alright.

Chances are I'll have to do something at the end to make travis build anyway so I'm happy to do those bits (it'll be one or two commits max) and give you another look before this is finally merged. Would that be alright?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/Health-SocialCare-Scotland/phsmethods/pull/18#issuecomment-611628636, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHHPIVZWYBFHEWXQWBLYMLLRLX3EZANCNFSM4KLEP63Q.


This message may contain confidential information. If you are not the intended recipient please inform the sender that you have received the message in error before deleting it. Please do not disclose, copy or distribute information in this e-mail or take any action in relation to its contents. To do so is strictly prohibited and may be unlawful. Thank you for your co-operation.

NHSmail is the secure email and directory service available for all NHS staff in England and Scotland. NHSmail is approved for exchanging patient data and other sensitive information with NHSmail and other accredited email services.

For more information and to find out how you can switch, https://portal.nhs.net/help/joiningnhsmail

jackhannah95 commented 4 years ago

Haha thanks! Sorry for bugging you. If you ever become a maintainer in the future you're welcome to change all the argument names. But then I guess you'd have to explain to everyone why their code doesn't work anymore 🤷

Hopefully these last two commits take care of everything and we can finally merge this.

jvillacampa commented 4 years ago

I think now @davidc92 needs to approve before we can merge