Closed ericoporto closed 7 months ago
You may optimize this function for the case where there's no removed whitespace, and return same string pointer.
return same string pointer.
Wouldn't it not being a copy be something unexpected here?
Wouldn't it not being a copy be something unexpected here?
No, as Strings are immutable and cannot be modified, making unchanged copies cannot serve any purpose.
Truncate does the same: it returns same string pointer if resulting length did not change.
I did same in recent Replace update. Strictly speaking, few other functions could do this too (like Substring), but probably no one bothered to handle such case.
Hey, is the way of going backwards that I am doing correct? I was having a hard time trying to understand the different lengths in the header?
If I understood this optimization requires checking that both the first and last character aren't space, but I am not sure if my approach for the backwards is correct.
If I understood this optimization requires checking that both the first and last character aren't space, but I am not sure if my approach for the backwards is correct.
I don't think that's necessary, you already have a loop that stops if it's not a space.
Simply comparing copylen
with the old length should be enough.
EDIT: hmm, maybe the second loop looks a bit strange, I will double check.
Oh I see now, you are selecting the last character by taking -1 from the string's length. That's not correct, because utf-8 character may be of multiple bytes. So you may end up in the middle of a unicode character.
I cannot find any ready function that goes one utf-8 character back... I was supposed to add one, but forgot. The example of how this is done may be found in couple of places. For example, this is backspace implementation in TextBox: https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/Common/gui/guitextbox.cpp#L87-L96
Following loop seeks for the start of the previous char:
for (; ptr > text.GetCStr() && ((*ptr & 0xC0) == 0x80); --ptr);
EDIT: Unrelated, about string copying, there's String.Copy function for making explicit copy of a String, although I have no idea what is this function's purpose.
ok, from this I understand that the "back one char" would be like - where it takes a pointer to somewhere in the string (str_back
) and backs 1 or don't if it's the same as the str[0]
(str_front
)
const char* back_one_char(const char* str_back, const char* str_front)
{
for (; str_back > str_front && ((*str_back & 0xC0) == 0x80); --str_back);
return str_back;
}
Should I hide it inside the function itself or do you think this could be added somewhere in unicode.c
or somewhere else?
There's utf8.h for utf8-related operations: https://github.com/adventuregamestudio/ags/blob/master/Common/util/utf8.h
But it depends on whether you intend a function strictly for utf-8 strings, or a universal "one char back" function.
In the first case you'd need to make a switch in code, checking if (get_uformat() == U_UTF8)
, and use either ascii way of ptr-1
or calling utf-8 function. (Probably best to have 2 separate loops in this case.)
In the latter case you'd have to replicate what allegro does with universal function pointers. Where it has a table of pointers:
https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/include/allegro/unicode.h#L52-L58
and these pointers are assigned to actual implementations using this table:
https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/src/unicode.c#L483-L484
and this function:
https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/src/unicode.c#L540
In which case you may add something like ugetx_reverse
(or similar) for consistency.
actually the function I thought only works if it's not the null char, so you can't put in the end of the string and go back one char... Made a small change and put in the utf8.h
instead of the allegro stuff.
oh, btw, if you want to merge 3.6.1 in master and master in ags4 before I can fix conflicts in my code here (from the String Replace, if git complains)
This works with simple spaces (code 32), but it did so before.
I had to go to the unicode table and copy some special "space" character (0x2000) to test utf-8 properly. Latest commit causes infinite loop with the following string:
String s = " this is a string ";
Display("%s", s.Trim());
The reason is that when BackOneChar arrives at a single-byte char it does not do --c
even once, and bails out.
I think this function should assume that you passed a pointer next byte after the letter, and start with --c
, like:
for (--c; c > front && ((*c & 0xC0) == 0x80); --c);
then you also don't need to test for null terminator.
EDIT: no, sorry, something is still wrong with this suggestion.
Thanks! I adjusted it! Edit: push forced a few safety checks.
In the current master, in string replace the mingw error appears to require that static_cast to be a reinterpret_cast<fn_strstr>(strstr)
. I was actually able to build locally with the mingw I have in CLion but not in the one in MSYS2 for whatever reason - not sure yet.
I found another mistake; at first I thought it's inside BackOneChar, but it's where you calculate copylen:
size_t copylen = nonSpaceBack - nonSpaceFront + 1;
it must be
size_t copylen = nonSpaceBack - nonSpaceFront + ucwidth(c);
where c
is the last met non-space character.
That's because nonSpaceBack ends up before that non-space character, not after.
But, alternatively, it may be more convenient to save a pointer before advancing back next char, and use that here.
I'm using following script as a test:
String s = " latin string ";
Display("%s", s.Trim());
s = " latin string ";
Display("%s", s.Trim());
s = " こんにちは ";
Display("%s", s.Trim());
s = " こんにちは ";
Display("%s", s.Trim());
Note that these spaces contain non-trivial spaces too (0x2000), except you cannot see that right here...
ok, it looks like this version works! This was the change. Thanks for figuring it out. Edit: did a push force because I rebased to the master merge to ags4.
In the current master, in string replace the mingw error appears to require that static_cast to be a
reinterpret_cast<fn_strstr>(strstr)
. I was actually able to build locally with the mingw I have in CLion but not in the one in MSYS2 for whatever reason - not sure yet.
I could not find a fix for this yet, if I change to reinterpret_cast
, then it fails almost everywhere else, because compiler cannot deduce which function overload to take. I will look into this again tomorrow.
One possible thing to try tomorrow is using strstr from the equivalent c++ header (std::strstr
), usually those have more consistent types across implementations.
Other possibility is to have our own strstr that wraps strstr so we have a way to ensure its calling convention. Picked the idea from here. The other possibility is to move the call to it's own function that takes a bool if it's case sensitive, avoiding the need for a function pointer.
Small note, I simplified StepCharBack to not have a separate if
check:
https://github.com/adventuregamestudio/ags/commit/918a257fcd384105dc6a85f9c9ac36030b606228
Added small automatic test for this function: https://github.com/adventuregamestudio/ags/commit/2d994922d88c22a7d265a0a6388abddf2d8696a1
I'll backport this function to master and replace instances met in code.
There is always a time where I need to use this when reading from a custom file, I have it done in AGS Script, but it really should be right there in the String itself, as it exists in Javascript, C# and all other languages I use.
An example of place where I would like to use this would be in something like dicttoini.
Edit: had to push force minor utf-8 proper fix.