Develop (and implement) a more general concept for abbreviations

Tazinho commented 5 years ago

Abbreviations can be completely arbitrary. They can include any combination of upper/lower case letters, digits or special symbols.

However, when trying to separate abbreviations from strings, it might be important not to cut pieces out of words, which were not beeing intended as abbreviations.

Therefore, abbreviations should only be cut out from words, when there exists enough evidence in form of word breaks that they were really meant as abbreviations.

Therefore, we try to be as specific as possible. The rules might be a bit complex as a whole, but they are rational and the fact that we only need to differentiate between a specific set of possibilities might lower the complexity a bit.

In general we need to consider the first and the last symbol of an abbreviation independently. This can be a letter (lower or upper case), a number, or a non alpha-numeric character.

Further, in case of a word start we don't need to think about the first character of the abbreviation (similar about the last character when we are at the end of the word).

Lets start with the easy ones.

non alpha-numeric character: In this case we don't need to worry about the characters next to the abbreviation as in typical language no words consist of following non alpha-numeric characters.

digits; In case of a digit it is only importang that the directly surrounding characters are not digits as well.

small letters: In case of a small letter it is only importang that the directly surrounding characters are not small letters as well.

big letters In case of big letters it gets harder and we need to differentiate several cases In case of no uppercase letters surrounding the abbreviation, we are fine. In case of an uppercase letter coming directly next to the abbreviation, this is only ok, if it is followed by a lower case letter.

Tazinho commented 5 years ago

In general it might be good to uppercase abbreviations after the conversion step for the cases lower_camel, upper_camel, mixed and title. Especially for title case one enhancement would be that one could use snake instead of parsed case in the creation.

Tazinho commented 5 years ago

This should really be resolved and further clearified, as the current implementation is also error prone, e.g. see:

> to_any_case("identicalID", abbreviations = "id")
[1] "id_entical_id"

Tazinho / snakecase

Develop (and implement) a more general concept for abbreviations #165