OuhscBbmc / OuhscMunge

Data manipulation operations
http://ouhscbbmc.github.io/OuhscMunge/
Other
1 stars 4 forks source link

Proposed update to OuhscMunge::snake_case() #127

Closed rwilson8 closed 1 year ago

rwilson8 commented 1 year ago

Is your feature request related to a problem? Please describe.

I regularly encounter variable names that OuhscMunge::snake_case(), and by extension OuhscMunge::column_rename_headstart(), can't handle, so I'm proposing new regex. Here are the original function and a messy column name:

snake_case <- function (x) 
{
  s <- gsub("\\.", "_", x)
  s <- gsub("(.)([A-Z][a-z]+)", "\\1_\\2", s)
  s <- tolower(gsub("([a-z0-9])([A-Z])", "\\1_\\2", s))
  s <- gsub(" ", "_", s)
  s <- gsub("__", "_", s)
  s
}
sample_string <- "What's your race?
(Select all that apply:)"

Describe the solution you'd like

Currently the function converts periods to underscores. I want it to convert all punctuation and spaces to underscores (except apostrophes so that contractions don't get split up). Also, it currently converts 2 consecutive underscores into 1, and I want it to convert an arbitrary amount of consecutive underscores into 1 and then remove leading and trailing underscores. Here is my proposed alternative:

snake_case <- function(x) {
  s <- gsub("'", "", x)
  s <- gsub("[[:punct:]]|[[:space:]]", "_", s)
  s <- gsub("(.)([A-Z][a-z]+)", "\\1_\\2", s)
  s <- tolower(gsub("([a-z0-9])([A-Z])", "\\1_\\2", s))
  s <- gsub("_+", "_", s)
  s <- gsub("^_|_$", "", s)
  return(s)
}

Line 1 removes apostrophes. Line 2 converts all remaining punctuation or space characters to underscores. Lines 3 and 4 are unchanged. Line 5 reduces any amount of consecutive underscores to 1. Line 6 removes leading and trailing underscores.

Describe alternatives you've considered snakecase::to_snake_case() and janitor::clean_names().

Additional context N/A

wibeasley commented 1 year ago

Sounds good. Are you interested in submitting a PR?

When I wrote this function years ago, I was considering mostly valid column name that simply used a different naming convention. But you're right, we're using this function a lot on Excel spreadsheets where the the author never intended the variable names to be used in a database or programming language.

rwilson8 commented 1 year ago

@wibeasley I tried, but it said it couldn't find the Master branch. (I assume because it's now called "main"). I guess I'll need you to do the update.

image

wibeasley commented 1 year ago

Yea, I'll take care of it. I posted this about the same time as your message above: https://github.com/OuhscBbmc/OuhscMunge/pull/128#issuecomment-1511937736

wibeasley commented 1 year ago

@rwilson8, I touched up some things in #130. I think they are all consistent with your goals. Tell me if not. Thanks again for thinking of these expanded use case.

rwilson8 commented 1 year ago

@wibeasley I tested it out with some of my messier column names, and it worked great. Thank you!