Public-Health-Scotland / phsmethods

An R package to standardise methods used in Public Health Scotland (https://public-health-scotland.github.io/phsmethods/)
https://public-health-scotland.github.io/phsmethods/
54 stars 13 forks source link

Function to create age groups #27

Closed chrisdeans closed 4 years ago

jackhannah95 commented 4 years ago

Hi @chrisdeans,

Good work 👍 this will be a good addition to the package.

I haven't looked at the tests or documentation much yet, because I think the function(s) should be restructured slightly, and that will presumably have a knock-on effect.

Overall I think only one of the functions needs to exist, and I think it should be get_agegroup_seq() (although I would just call it age_group(); the "get" seems a bit superfluous to me). I think that the breaks argument in get_agegroup() will be quite hard to understand for less-experienced R users (which is probably why everyone uses a giant case_when() instead of cut() as it is).

I think users experienced enough to understand breaks = c(seq(0, 90, by = 5), Inf) are probably also experienced enough to be able to bypass this function entirely and use

cut(ages, 
    breaks = c(seq(0, 90, by = 5), Inf),
    labels = c(paste(seq(0, 85, by = 5), 
                     seq(4, 84, by = 5),
                     sep = "-"), 
               paste(90, "+", sep = "")))

to get the same result. get_agegroup_seq() seems much more user-friendly and, ultimately, beginner-friendly to me, so I'd be inclined to go with that.

I think there are probably also too many arguments in get_agegroup(). If get_agegroup_seq() becomes the overarching age_group() function then I don't think it needs the include_zero, include_inf, sep or above.char arguments. Adding that many arguments increases the function's complexity and I don't think it needs to be quite as flexible. Anyone wishing to exclude certain ages from this will presumably have filtered them out already, and I really think that anyone not happy with a - separating numbers or a + signifying "and over" can lump it. They're not hard to change afterwards if anyone is so inclined.

I've hummed and hawed about the as_factor argument vs just making it a character by default, but I suppose the factor setting is useful in a group_by() with .drop = FALSE so I'd be happy for that to stay.

There's quite a lot of error handling in the get_agegroup() that could hopefully be simplified a little by combining into one function and removing the include_zero and include_inf arguments. I'll leave looking too closely at those for now.

I hope these comments seem reasonable. I'm finishing at ISD on April 17th and ideally I'd like to do another release of the package with this and another couple of functions included before I go. Hopefully that's doable but not the end of the world if you don't get around to it.

Cheers,

Jack

chrisdeans commented 4 years ago

Hi @jackhannah95 ,

I've updated the function (now age_group()) along the lines you suggested. Could you have another look over it at some point?

It turns out after you take out those parameters all of the error handling also disappears; cut(), seq() and as.numeric() can handle any remaining error cases.

I've left in as_factor for the moment , but I've changed the default to FALSE i.e. character.

Thanks, Chris

chrisdeans commented 4 years ago

Hi @jackhannah95, thanks for the in-depth comments! I've added a new version, hopefully covering everything you mentioned. Let me know if you spot anything else!