Add `nomark` for striping accents, `samemark` is counterintuitive and slow for that purpose.

bbkr commented 6 months ago

Stripping accents to get base characters is very common operation in text indexing and searching. Raku allows to do it through:

"mówić".samemark( "a" )

There are two issues with that method:

For someone not knowing documented remaining characters in $string receive the same mark/accent as the last character in $pattern behavior it is confusing and just looks weird. It's just arbitrairly chosen magic "repeat last character" WAT behavior.
Underlying implementation takes unnecessary steps because it must parse $pattern and extract decomposed grapheme codepoints of last character. With all the hassle of monitoring $pattern length at the same time.

My proposal is to add easy to use, optimized method for stripping accents:

say "mówić".nomark()   # mowic

One that will preferably avoid reallocating whole string splitting/joining if there are no marks to strip. Because usually this method will be called blindly on any given input - with or without marks. As Raku is more commonly used with LLMs this may be good addition to language.

coke commented 6 months ago

regexes have not only samemark but alsoignoremark

https://docs.raku.org/syntax/%3Aignoremark

In case we wanted to follow that naming convention. (I realize it's not a 100% match)

bbkr commented 6 months ago

@coke: I think this is early design inconsistency, probably dated back to Apocalypses.

samemark is adjective (which?)
ignoremark is noun (what to do?), should be ignoreDmark

That's why I'm not big fan of copying this mixed naming. I was also thinking about basemark name.

lizmat commented 6 months ago

https://github.com/rakudo/rakudo/pull/5562 contains an implementation.

Raku / problem-solving

Add `nomark` for striping accents, `samemark` is counterintuitive and slow for that purpose. #427