Numerals in abbreviations

ilarischeinin commented 5 years ago

I have a case that I cannot figure out how to get right. I don't think it's an exotic one, so I'd imagine it must be already possible, but just can't get my head around how to specify it with the available parameters.

I want to convert (from snake case) "t2d_status" to lower camel case "t2dStatus". The problem is that no matter what I try, I get "t2DStatus", i.e. with a capital "D" whereas I need a lowercase "d".

library(snakecase)
to_lower_camel_case("t2d_status")
#> [1] "t2DStatus"

I tried to specify "t2d" as an abbreviation so it wouldn't get broken down:

to_lower_camel_case("t2d_status", abbreviations = "t2d")
#> [1] "t2DStatus"

Also tried to specify to keep numerals as is:

to_lower_camel_case("t2d_status", numerals = "asis")
#> [1] "t2DStatus"

And to change sepin to just "":

to_lower_camel_case("t2d_status", sep_in = "_")
#> [1] "t2DStatus"

^{Created on 2018-11-16 by the reprex package (v0.2.1)}

None of these seem to help (nor my attempts with parsing_option or transliteration), so could you please point me to the right direction here? Thank you.

I did try to go through the issue tracker to see if a case like this had popped up before, but that was kind of difficult as so many issues are not very descriptive, but more of record keeping on things to be implemented.

Tazinho commented 5 years ago

Thanks for reporting this. In theory you are right with the first approach. The implementation of abbreviations is just too naive atm.

Currently matches of abbreviations will be surrounded internally by underscores to ensure they are recognized as substrings. However, the substrings (abbreviations) are then parsed further and in your case t2d will be parsed into 3 substrings (because of the number).

I think a perfect solution would be to ignore the abbreviations during the parsing step. However, I am not sure how to implement this in an elegant way and will have to think a bit about that.

Tazinho commented 5 years ago

Possible implementation idea: string -> abbreviations ->

Match only specific sequences for abbreviations: 1. Sequences of upper case letters (and numerals) and then sequences of lower case letters (and numerals). (Possibly also test abbreviations to be not in mixed case)
Matches will be replaced with
- the result of paste(ABR, Sample(LETTERS, 3), digit). numeral digits will be replaced by abcdefghi. The replacement will be surrounded by underscores.
- Before that, it will be checked that paste(ABR, sample(LETTERS, 3)) is not contained within the string. If so, the sample length will be increased by one until if fits.

sep_in -> parsing_option -> split ->

Rereplacement of the placeholders by the original abbreviations

-> ...

Edit: otherwise it might be possible to work around the numerals parsing

The third and possibly best approach would be to split first on the abbreviations, mark the abbreviations and then split a second time on the parsing of the non-abbreviation substrings. However, will need to evaluate this approach in a new dev branch first.

Tazinho commented 5 years ago

Once I get to this the process must probably look like this:

check if abbreviations were supplied
if true, check for matches
if matches occur, split the regarding strings and provide information on which of the substrings are abbreviations
parse the substrings (which aren’t abbreviations) further...

Tazinho commented 5 years ago

The above still sounds like significant overhead. Maybe the following could work:

replace spaces by "_" in the very beginning in this way spaces can be used to mark abbreviations
in order to mark the side of each abbreviations use the pattern" labbreviationr "
now for each parsing helper include a negative lookbehind (<not space followed by l, but it is ok if an r followed by space>) and a similar negative lookahead so that only substrings outside of the abbreviation "scopes" are parsed
Look into the current implementation of the numerics argument (here also some markers including spaces were used) and see if problems can be resolved...

Tazinho commented 5 years ago

Implemented in devversion-01 branch for now (almost as mentioned in the last post; not yet tested; also need to remove some overhead introduced by the current verbose implementation):

replace spaces by "_" in the very beginning in this way spaces can be used to mark abbreviations
in order to mark the side of the abbreviations use the pattern" l labbreviationr r " (ensure that the pattern only occures once for each abbreviation and correct wrong cases via gsub)
now split each string by the pattern "\sl|r\s". Now apply the parsing steps as one function inside an lapply. For each string use an ifelse to only parse those strings that don't start with "\sl".

Open steps:

look more closely into the markers for digits to ensure that these implementations don't collide.
write more tests
improve the speed (possibly one implementation without abbreviations, one with abbreviations that don't contain special characters/digits, one for any abbreviation)
- better just introduce logical subsetting and lapply once over only those cases that contain abbreviations and once (without lapply) over the other entries of string.
enable to parse more abbreviations (currently only abbreviatinos in the form of "blaABBR" or "ABBRbla" ["ABBR"/"abbr" can contain/start/end with any combination of characters but there must not be an switch from upper to lower case or vice verse] are parsed (and protected from other parsing options) correctly)

Tazinho / snakecase

Numerals in abbreviations #155