Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

CHI number functions #30

Closed graemegowans closed 4 years ago

graemegowans commented 4 years ago

Hey guys - I have some simple functions that I wrote for working with CHI numbers. One takes a 9 character CHI and pads it to 10 with a leading zero and the other does some basic validity checks (checks for length, valid characters, valid checksum) and returns error messages. I wanted to share these with some others in the team (and I was interested in learning how to make a package) so I had wrapped these into a package already: https://github.com/graemegowans/chilir. But if these would be more broadly useful then happy to add here? Thanks!

jackhannah95 commented 4 years ago

Hi @graemegowans,

Thanks for the suggestions!

Apologies if I'm being ignorant, but does chilir::pad_chi() do the same thing as stringr::str_pad()?

stringr::str_pad(c(123456789, 0123456789), width = 10, pad = "0")
#> [1] "0123456789" "0123456789"

A function that does CHI validation is interesting, and I believe both @alan-y and @Moohan have written similar functions in the past. Could I suggest speaking with them first? I have no idea what the best way of doing it would be and I don't want to be biased towards anyone's particular method, so it would be great if you guys could try to reach some consensus prior to anything being contributed to phsmethods.

Moohan commented 4 years ago

I've written a CHI check function but only in SPSS, I looked at @graemegowans chi_check() function (https://github.com/graemegowans/chilir/blob/master/R/chi_check.R) and it seems to do everything mine did i.e. test the check digit and also try to validate the date. I think the date check could be developed by adding something like is.na(lubridate::dmy(date_part) which would be shorter and also catch dates with months with fewer than 31 days.

graemegowans commented 4 years ago

Thanks guys! @jackhannah95 I'm sure I looked at the str_pad() function when I was working on it but the problem was that it was padding anything shorter than 10 rather than just those that have lost the leading zero, which I thought might not be what people would want.

stringr::str_pad("123", 10, pad = "0") [1] "0000000123"

Although this would be invalid anyway, so maybe not a problem. Not sure what the most appropriate behaviour would be?

Thanks @Moohan, that's a good idea, I can change that.

jackhannah95 commented 4 years ago

Ah I see. In that case I'd be quite happy to have a more specific function in phsmethods then that pads only 9 digit characters, as long as the documentation makes that restriction clear. I guess the two functions would have to work in tandem most of the time, i.e.

df %>%
  mutate(chi = chi_pad(chi),
         chi = chi_check(chi))

I would have "chi" be the first word in both function names and group them together in the same script with the same documentation, much like the qtr() functions have been.

I think consideration would need to be given to how they fail as well. For chi_pad() it would seem pretty obvious to leave any non-9 digit numbers as they are, but chi_check() is trickier. I guess if I were using this on a large input vector I'd probably want it to flag the invalid ones in some way (maybe by converting them to NA?) and leave the valid ones alone.

In the meantime while you're doing any mulling over chi_check(), feel free to make a branch and add in chi_pad()!

alan-y commented 4 years ago

I think consideration would need to be given to how they fail as well. For chi_pad() it would seem pretty obvious to leave any non-9 digit numbers as they are, but chi_check() is trickier. I guess if I were using this on a large input vector I'd probably want it to flag the invalid ones in some way (maybe by converting them to NA?) and leave the valid ones alone.

I had a second function called is_valid_chi() to flag these ones.

I think the date check could be developed by adding something like is.na(lubridate::dmy(date_part) which would be shorter and also catch dates with months with fewer than 31 days.

This is how I think I done it.

graemegowans commented 4 years ago

Thanks for all the feedback guys. My idea for the chi_check() function was for it not to operate on the numbers themselves, but to just flag invalid ones (similar to what @alan-y said about is_valid_chi()) - so something like:

df <- tibble(chi_no = c("0101209991", "3201209991", "0113209991", "3213209991",
                        "123", "12345678901", "?123456789", "A123456789", "0101209992"))
df %>% mutate(checked = chi_check(chi_no))
#> # A tibble: 9 x 2
#>   chi_no      checked          
#>   <chr>       <chr>            
#> 1 0101209991  <NA>             
#> 2 3201209991  invalid day      
#> 3 0113209991  invalid month    
#> 4 3213209991  day/month invalid
#> 5 123         too short        
#> 6 12345678901 too long         
#> 7 ?123456789  invalid character
#> 8 A123456789  invalid character
#> 9 0101209992  invalid checksum
alan-y commented 4 years ago

@graemegowans, it looks good. Remember to implement a check for leap years in the dates (otherwise we'll upset the Feb 29 babies - I know one!)

jackhannah95 commented 4 years ago

Looks good @graemegowans 👍 open a PR when you're ready and we'll do a review for you.

jackhannah95 commented 4 years ago

Hi @graemegowans. How are you getting on? Hoping to do another release of the package within the next 2 weeks. It'd be great to get these functions in if possible, but it's no problem if they have to wait until the next one.

graemegowans commented 4 years ago

Sorry @jackhannah95, was on leave! Everything is ready to go for review I think so I will make a request hopefully today.

jackhannah95 commented 4 years ago

Added in #31.