Tazinho / snakecase

🐍🐍🐍 A systematic approach to parse strings and automate the conversion to snake_case, UpperCamelCase or any other case.
GNU General Public License v3.0
147 stars 9 forks source link

Think about ignore argument #151

Closed Tazinho closed 6 years ago

Tazinho commented 6 years ago

The default sep_in argument is very restrictive about non alpha-numerics. Every non alpha-numeric is per default treated as an input separator. Instead of the current behaviour to allow just i.e. one more separator via i.e. sep_in = "-" and override the sep_in property of all other non-alphanumerics at the same time, it might make sense to remove i. e. "-" from the set of separators, but treat the others still as separators (or find a nifty regular expression as a workaround to handle this inside sep_in).

Tazinho commented 6 years ago

The following regex pattern should make it. So the extra argument could be implemented, but might just not be needed:

to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|-]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|:]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.\\-]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.\\-\\:]")
Tazinho commented 5 years ago

Especially to get rid of apostrophes one might need to include them first and then replace them during the transliterations step.

to_any_case("I don't like fish", sep_in = "[^[:alnum:]|']",
            transliterations = c("'" = ""))
francisbarton commented 3 years ago

"exception"/"ignore" parameter?

I don't know if this belongs here - but I didn't want to create a new issue, especially as there is a relatively easy workaround.

I am a janitor user and regularly use clean_names(). I only just today discovered its customisability through the replace argument! If I do this:

df %>% 
  clean_names(replace = c("<" = "-", ">" = "gt", "=" = "eq"))

then the first replacement (-) still gets superseded by the call to snake case, as it should, since - is not part of snake case. As an easy workaround, I can change this in the next line by using, for example, dplyr::rename_with(). But I'm wondering if there could be a way of passing exceptions to the function, saying "don't convert these characters".

In my example above, I would write:

df %>% 
  clean_names(replace = c("<" = "-", ">" = "gt", "=" = "eq"), ignore = "-")

and this would allow the - through unscathed.

I think that I ought to be able to flag the - as an allowed character via sep_in ??? :

df %>%
  clean_names(replace = c("<" = "-"), sep_in = "[^[:alnum:]|-]")

but that still ends up with the - converted to _. Reprex:


construction %>% 
  janitor::clean_names(replace =  c("to" = "-"), sep_in = "[^[:alnum:]|-]") %>% 
#> [1] "year"             "month"            "x1_unit"          "x2_4_units"      
#> [5] "x5_units_or_more" "northeast"        "midwest"          "south"           
#> [9] "west"

# (checking the regex...)
construction %>% 
  names() %>% 
  `[`(4) %>% 
  stringr::str_replace_all(" to ", "-") %>% 
  stringr::str_replace_all("[^[:alnum:]|-]", "_")
#> [1] "2-4_units"

Created on 2021-02-18 by the reprex package (v1.0.0)

Tazinho commented 3 years ago

Hi @francisbarton the issue lies in janitor::make_clean_names.

In snakecase the sep_in works as expected.

to_any_case(string = "2-4 units",
            sep_in = "[^[:alnum:]|-]")
# [1] "2-4_units"

So, one alternative for your use case would be to switch to snakecase and just use the transliterations argument for your replacements:

to_any_case(string = "2 to 4 units", transliterations = c("to" = "-"))
# [1] "2-4_units"

To see where the error gets introduced in janitor, you could run:

make_clean_names(string = "2 to 4 units",
                 replace = c("to" = "-"),
                 sep_in = "[^[:alnum:]|-]",
                 use_make_names = FALSE)

This should show us that the following line introduces an unintended conversion:

cleaned_within <- stringr::str_replace(
  string = good_start, 
  pattern = "[\\h\\s\\p{Punctuation}\\p{Symbol}\\p{Separator}\\p{Other}]+", 
  replacement = "."

As good_start is "2 - 4 units" at this line, we get an intermediate result of

cleaned_within <- stringr::str_replace(
    string = "2 - 4 units", 
    pattern = "[\\h\\s\\p{Punctuation}\\p{Symbol}\\p{Separator}\\p{Other}]+", 
    replacement = "."
# [1] "2.4 units"

This happens, before snakecase gets called, so there is no really stable workaround via the snakecase arguments within make_clean_names().

francisbarton commented 3 years ago

Thank you @Tazinho - very thorough investigation. I'm embarrassed now about putting it here at all, I should have posted in janitor's issues. I just made an assumption that it would be an issue with snakecase.