Closed Tazinho closed 6 years ago
The following regex pattern should make it. So the extra argument could be implemented, but might just not be needed:
to_any_case(".bla.bla-bla:bla.")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|-]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|:]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.\\-]")
to_any_case(".bla.bla-bla:bla.", sep_in = "[^[:alnum:]|\\.\\-\\:]")
Especially to get rid of apostrophes one might need to include them first and then replace them during the transliterations step.
to_any_case("I don't like fish", sep_in = "[^[:alnum:]|']",
transliterations = c("'" = ""))
"exception"/"ignore" parameter?
I don't know if this belongs here - but I didn't want to create a new issue, especially as there is a relatively easy workaround.
I am a janitor
user and regularly use clean_names()
. I only just today discovered its customisability through the replace
argument!
If I do this:
df %>%
clean_names(replace = c("<" = "-", ">" = "gt", "=" = "eq"))
then the first replacement (-
) still gets superseded by the call to snake case, as it should, since -
is not part of snake case. As an easy workaround, I can change this in the next line by using, for example, dplyr::rename_with()
. But I'm wondering if there could be a way of passing exceptions to the function, saying "don't convert these characters".
In my example above, I would write:
df %>%
clean_names(replace = c("<" = "-", ">" = "gt", "=" = "eq"), ignore = "-")
and this would allow the -
through unscathed.
I think that I ought to be able to flag the -
as an allowed character via sep_in
??? :
df %>%
clean_names(replace = c("<" = "-"), sep_in = "[^[:alnum:]|-]")
but that still ends up with the -
converted to _
. Reprex:
library(tidyr)
construction %>%
janitor::clean_names(replace = c("to" = "-"), sep_in = "[^[:alnum:]|-]") %>%
names()
#> [1] "year" "month" "x1_unit" "x2_4_units"
#> [5] "x5_units_or_more" "northeast" "midwest" "south"
#> [9] "west"
# (checking the regex...)
construction %>%
names() %>%
`[`(4) %>%
stringr::str_replace_all(" to ", "-") %>%
stringr::str_replace_all("[^[:alnum:]|-]", "_")
#> [1] "2-4_units"
Created on 2021-02-18 by the reprex package (v1.0.0)
Hi @francisbarton the issue lies in janitor::make_clean_names
.
In snakecase the sep_in
works as expected.
library(snakecase)
to_any_case(string = "2-4 units",
sep_in = "[^[:alnum:]|-]")
# [1] "2-4_units"
So, one alternative for your use case would be to switch to snakecase and just use the transliterations
argument for your replacements:
to_any_case(string = "2 to 4 units", transliterations = c("to" = "-"))
# [1] "2-4_units"
To see where the error gets introduced in janitor, you could run:
debugonce(make_clean_names)
make_clean_names(string = "2 to 4 units",
replace = c("to" = "-"),
sep_in = "[^[:alnum:]|-]",
use_make_names = FALSE)
This should show us that the following line introduces an unintended conversion:
cleaned_within <- stringr::str_replace(
string = good_start,
pattern = "[\\h\\s\\p{Punctuation}\\p{Symbol}\\p{Separator}\\p{Other}]+",
replacement = "."
)
As good_start
is "2 - 4 units"
at this line, we get an intermediate result of
cleaned_within <- stringr::str_replace(
string = "2 - 4 units",
pattern = "[\\h\\s\\p{Punctuation}\\p{Symbol}\\p{Separator}\\p{Other}]+",
replacement = "."
)
cleaned_within
# [1] "2.4 units"
This happens, before snakecase gets called, so there is no really stable workaround via the snakecase arguments within make_clean_names()
.
Thank you @Tazinho - very thorough investigation. I'm embarrassed now about putting it here at all, I should have posted in janitor's issues. I just made an assumption that it would be an issue with snakecase.
The default
sep_in
argument is very restrictive about non alpha-numerics. Every non alpha-numeric is per default treated as an input separator. Instead of the current behaviour to allow just i.e. one more separator via i.e.sep_in = "-"
and override the sep_in property of all other non-alphanumerics at the same time, it might make sense to remove i. e. "-" from the set of separators, but treat the others still as separators (or find a nifty regular expression as a workaround to handle this insidesep_in
).