fubark / cyber

Fast and concurrent scripting.
https://cyberscript.dev
MIT License
1.14k stars 38 forks source link

Indexing strings uses code points rather than grapheme clusters #10

Open ifreund opened 1 year ago

ifreund commented 1 year ago

Example code:

i = '👨‍👨‍👦‍👦c'.indexChar('c')
print 'Found char at {i}.' -- Found char at 7.

Indexing strings based on codepoint doesn't really make sense with regards to how unicode is actually rendered and I think it would be a shame to bake these semantics into the language. Yes this is how Python and many other languages have built their string APIs, however some languages such as Swift seem to have taken a somewhat different approach that makes doing the wrong thing in the presence of unicode a lot harder. See this blog post for an overview: https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

I'm not trying to tell you how best to design your language or anything like that, I just want ensure you are aware of the tradeoff here.

fubark commented 1 year ago

I agree, strings needs to be redesigned. First of all, it doesn't make sense to call each codepoint a character when there is a grapheme cluster. Second, making it too easy to do strings ops can have a large impact on performance without the user knowing what is happening. So a tradeoff needs to be made.

One idea is that internally the strings types still have this ascii and non ascii distinction. In the case of the ascii, you get all these fast ops. And for the non-ascii, it behaves more like swift where only basic ops are allowed. For additional ops you would need to create a new view over it.