Closed bbkr closed 6 months ago
regexes have not only samemark
but alsoignoremark
https://docs.raku.org/syntax/%3Aignoremark
In case we wanted to follow that naming convention. (I realize it's not a 100% match)
@coke: I think this is early design inconsistency, probably dated back to Apocalypses.
samemark
is adjective (which?)ignoremark
is noun (what to do?), should be ignoreDmark
That's why I'm not big fan of copying this mixed naming. I was also thinking about basemark
name.
https://github.com/rakudo/rakudo/pull/5562 contains an implementation.
Stripping accents to get base characters is very common operation in text indexing and searching. Raku allows to do it through:
There are two issues with that method:
remaining characters in $string receive the same mark/accent as the last character in $pattern
behavior it is confusing and just looks weird. It's just arbitrairly chosen magic "repeat last character" WAT behavior.$pattern
and extract decomposed grapheme codepoints of last character. With all the hassle of monitoring$pattern
length at the same time.My proposal is to add easy to use, optimized method for stripping accents:
One that will preferably avoid reallocating whole string splitting/joining if there are no marks to strip. Because usually this method will be called blindly on any given input - with or without marks. As Raku is more commonly used with LLMs this may be good addition to language.