liz3 commented 2 months ago

Add Unicode support to strings

Resolves: https://github.com/dictu-lang/Dictu/issues/317

What's Changed:

This updates the implementation of the string API to support unicode using the utf8.h library.

Theres still a few incomplete things, some fuctions do not need unicode versions(strip functions). But on the other side there are things where the library does not provide full functionality which is needed for isUpper/isLower for example which do not wrong, forcing the usage of the c std lib apis.

Should String.len() return byte length or character length, what if the string has invalid utf8? calculating the length upfront for large strings, always might be expensive.

And not enough tests are added.

Type of Change:

[ ] Bug fix
[X] New feature
[X] Breaking change (fix or feature that would cause existing functionality to not work as expected)

#

Housekeeping:

[x] Tests have been updated to reflect the changes done within this PR (if applicable).
[ ] Documentation has been updated to reflect the changes done within this PR (if applicable).

#

Screenshots (If Applicable):

Jason2605 commented 1 month ago

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

liz3 commented 1 month ago

This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!

Should String.len() return byte length or character length

I would expect this to return the character length not the byte length. I like the byteLen addition! I'm more than happy with adding some extra test cases!

We will need to make sure this also works with the stdlib modules too, quite a big change!!

Hey, ive continued on this. I think the changes are needed, but i want to make clear it introduces a overhead into indexing/slicing and allocation of strings since simple indexing does not work anymore and is iterative now, further when allocating the character length is computed, here we could also compute it only when required but since the length is commonly used i opted to compute it upfront One entirely opposite approach is to store the strings as unicode in memory, this would mean a massive increase in memory usage but retain O(1) indexing capabilities and so on, im not entirely sure what the better tradeoff is here but the more computational expensive seamed more reasonable since a lot of strings are ascii.

Ive went with approaching that most string functions will throw errors when provided with invalid utf-8, i do not know how the library would behave when given invalid utf-8. Exceptions are indexing and slicing which will fallback to the byte handling. It might also be worth adding a function which lets the user know if the string is valid utf-8 maybe?

Jason2605 commented 1 month ago

The computing for indexing etc is a tradeoff we will just have to accept unfortunately for support of UTF 8 (and also the route I would have went down personally) so I'm happy with that!

Yeah it's a valid concern and something we could potentially add to the stdlib, I have a feeling the chances of invalid utf 8 actually being used would be pretty low though so something we can potentially think about in the future if needs be.

Again, appreciate all the work you've done on this!

liz3 commented 4 weeks ago

I am not entirely sure why the windows tests fail, there does not seam to be a clear test which is failing Sadly at the moment i don't have access to a Window system.

liz3 commented 3 weeks ago

@Jason2605 I believe this is good for review / merge now. I have fixed the issue with the windows builds, which turned out to be a error on my side but it took a good while to find. I see that theres a macos failure but i have run the tests on my mb and they all pass.

Jason2605 commented 3 weeks ago

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

liz3 commented 3 weeks ago

Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.

@briandowns be good to get your thoughts on this PR too!

i have a arm mac, infact i only have apple sillicon machines

liz3 commented 3 weeks ago

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Jason2605 commented 3 weeks ago

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

liz3 commented 3 weeks ago

The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night

Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it

Aight maybe then a adresssanitizer can help to find it if its a heap corruption, Il try that

liz3 commented 2 weeks ago

@Jason2605 was able to repro with a debug build and fixed it: https://github.com/dictu-lang/Dictu/pull/749/commits/f8d481843ba1c1529f572b8dc378f58dbd17b053

liz3 commented 2 weeks ago

okay that's satisfying to be entirely honest!

Jason2605 commented 2 weeks ago

Incredible stuff, this is so so so nice!!!!!

liz3 commented 2 weeks ago

@Jason2605 i ran clang-format over the file in order to fix indentation

Jason2605 commented 2 weeks ago

Noice yeah I should probably add something to the GH actions for that!

Very much appreciated, once again thank you for this!

dictu-lang / Dictu

utf-8 support in strings #749

Add Unicode support to strings

What's Changed:

Type of Change:

Housekeeping:

Screenshots (If Applicable):