Closed liz3 closed 2 weeks ago
This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!
Should String.len() return byte length or character length
I would expect this to return the character length not the byte length. I like the byteLen
addition! I'm more than happy with adding some extra test cases!
We will need to make sure this also works with the stdlib modules too, quite a big change!!
This looks awesome and something definitely needed, thank you for this!! Sorry I've been so slow on this, been a bit hectic for me as of late!
Should String.len() return byte length or character length
I would expect this to return the character length not the byte length. I like the
byteLen
addition! I'm more than happy with adding some extra test cases!We will need to make sure this also works with the stdlib modules too, quite a big change!!
Hey, ive continued on this. I think the changes are needed, but i want to make clear it introduces a overhead into indexing/slicing and allocation of strings since simple indexing does not work anymore and is iterative now, further when allocating the character length is computed, here we could also compute it only when required but since the length is commonly used i opted to compute it upfront One entirely opposite approach is to store the strings as unicode in memory, this would mean a massive increase in memory usage but retain O(1) indexing capabilities and so on, im not entirely sure what the better tradeoff is here but the more computational expensive seamed more reasonable since a lot of strings are ascii.
Ive went with approaching that most string functions will throw errors when provided with invalid utf-8, i do not know how the library would behave when given invalid utf-8. Exceptions are indexing and slicing which will fallback to the byte handling. It might also be worth adding a function which lets the user know if the string is valid utf-8 maybe?
The computing for indexing etc is a tradeoff we will just have to accept unfortunately for support of UTF 8 (and also the route I would have went down personally) so I'm happy with that!
Yeah it's a valid concern and something we could potentially add to the stdlib, I have a feeling the chances of invalid utf 8 actually being used would be pretty low though so something we can potentially think about in the future if needs be.
Again, appreciate all the work you've done on this!
I am not entirely sure why the windows tests fail, there does not seam to be a clear test which is failing Sadly at the moment i don't have access to a Window system.
@Jason2605 I believe this is good for review / merge now. I have fixed the issue with the windows builds, which turned out to be a error on my side but it took a good while to find. I see that theres a macos failure but i have run the tests on my mb and they all pass.
Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.
@briandowns be good to get your thoughts on this PR too!
Nice one thank you! I don't have an ARM mac myself to exactly test unfortunately but will have a see if I can track someone down that does.
@briandowns be good to get your thoughts on this PR too!
i have a arm mac, infact i only have apple sillicon machines
The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night
The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night
Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it
The way runners fail also make it hard to see the error. The log does not really give you a hint or so to start debugging, but i ran all the tests on m2 air last night
Oh interesting I see. Yeah the logs are no help at all really 😂 Usually when it's just a dump on code 1 it's a segfault somewhere, so will need to see if we can track it
Aight maybe then a adresssanitizer can help to find it if its a heap corruption, Il try that
@Jason2605 was able to repro with a debug build and fixed it: https://github.com/dictu-lang/Dictu/pull/749/commits/f8d481843ba1c1529f572b8dc378f58dbd17b053
okay that's satisfying to be entirely honest!
Incredible stuff, this is so so so nice!!!!!
@Jason2605 i ran clang-format over the file in order to fix indentation
Noice yeah I should probably add something to the GH actions for that!
Very much appreciated, once again thank you for this!
Add Unicode support to strings
Resolves: https://github.com/dictu-lang/Dictu/issues/317
What's Changed:
This updates the implementation of the string API to support unicode using the utf8.h library.
Theres still a few incomplete things, some fuctions do not need unicode versions(strip functions). But on the other side there are things where the library does not provide full functionality which is needed for
isUpper
/isLower
for example which do not wrong, forcing the usage of the c std lib apis.Should
String.len()
return byte length or character length, what if the string has invalid utf8? calculating the length upfront for large strings, always might be expensive.And not enough tests are added.
Type of Change:
#
Housekeeping:
#
Screenshots (If Applicable):