adventuregamestudio / ags

AGS editor and engine source code
Other
708 stars 159 forks source link

ags4: add string trim (remove trailing and leading whitespace in string) #2397

Closed ericoporto closed 7 months ago

ericoporto commented 7 months ago

There is always a time where I need to use this when reading from a custom file, I have it done in AGS Script, but it really should be right there in the String itself, as it exists in Javascript, C# and all other languages I use.

An example of place where I would like to use this would be in something like dicttoini.

Edit: had to push force minor utf-8 proper fix.

ivan-mogilko commented 7 months ago

You may optimize this function for the case where there's no removed whitespace, and return same string pointer.

ericoporto commented 7 months ago

return same string pointer.

Wouldn't it not being a copy be something unexpected here?

ivan-mogilko commented 7 months ago

Wouldn't it not being a copy be something unexpected here?

No, as Strings are immutable and cannot be modified, making unchanged copies cannot serve any purpose.

Truncate does the same: it returns same string pointer if resulting length did not change.

I did same in recent Replace update. Strictly speaking, few other functions could do this too (like Substring), but probably no one bothered to handle such case.

ericoporto commented 7 months ago

Hey, is the way of going backwards that I am doing correct? I was having a hard time trying to understand the different lengths in the header?

If I understood this optimization requires checking that both the first and last character aren't space, but I am not sure if my approach for the backwards is correct.

ivan-mogilko commented 7 months ago

If I understood this optimization requires checking that both the first and last character aren't space, but I am not sure if my approach for the backwards is correct.

I don't think that's necessary, you already have a loop that stops if it's not a space. Simply comparing copylen with the old length should be enough.

EDIT: hmm, maybe the second loop looks a bit strange, I will double check.

ivan-mogilko commented 7 months ago

Oh I see now, you are selecting the last character by taking -1 from the string's length. That's not correct, because utf-8 character may be of multiple bytes. So you may end up in the middle of a unicode character.

I cannot find any ready function that goes one utf-8 character back... I was supposed to add one, but forgot. The example of how this is done may be found in couple of places. For example, this is backspace implementation in TextBox: https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/Common/gui/guitextbox.cpp#L87-L96

Following loop seeks for the start of the previous char:

for (; ptr > text.GetCStr() && ((*ptr & 0xC0) == 0x80); --ptr); 

EDIT: Unrelated, about string copying, there's String.Copy function for making explicit copy of a String, although I have no idea what is this function's purpose.

ericoporto commented 7 months ago

ok, from this I understand that the "back one char" would be like - where it takes a pointer to somewhere in the string (str_back) and backs 1 or don't if it's the same as the str[0] (str_front)

const char* back_one_char(const char* str_back, const char* str_front)
{
    for (; str_back > str_front && ((*str_back & 0xC0) == 0x80); --str_back);
    return str_back;
}

Should I hide it inside the function itself or do you think this could be added somewhere in unicode.c or somewhere else?

ivan-mogilko commented 7 months ago

There's utf8.h for utf8-related operations: https://github.com/adventuregamestudio/ags/blob/master/Common/util/utf8.h

But it depends on whether you intend a function strictly for utf-8 strings, or a universal "one char back" function.

In the first case you'd need to make a switch in code, checking if (get_uformat() == U_UTF8), and use either ascii way of ptr-1 or calling utf-8 function. (Probably best to have 2 separate loops in this case.)

In the latter case you'd have to replicate what allegro does with universal function pointers. Where it has a table of pointers: https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/include/allegro/unicode.h#L52-L58 and these pointers are assigned to actual implementations using this table: https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/src/unicode.c#L483-L484 and this function: https://github.com/adventuregamestudio/ags/blob/e86b84b08d85425ece8c589b4a3f05f3c0eb20e6/libsrc/allegro/src/unicode.c#L540 In which case you may add something like ugetx_reverse (or similar) for consistency.

ericoporto commented 7 months ago

actually the function I thought only works if it's not the null char, so you can't put in the end of the string and go back one char... Made a small change and put in the utf8.h instead of the allegro stuff.

ericoporto commented 7 months ago

oh, btw, if you want to merge 3.6.1 in master and master in ags4 before I can fix conflicts in my code here (from the String Replace, if git complains)

ivan-mogilko commented 7 months ago

This works with simple spaces (code 32), but it did so before.

I had to go to the unicode table and copy some special "space" character (0x2000) to test utf-8 properly. Latest commit causes infinite loop with the following string:

String s = "    this is a string    ";
Display("%s", s.Trim());

The reason is that when BackOneChar arrives at a single-byte char it does not do --c even once, and bails out. I think this function should assume that you passed a pointer next byte after the letter, and start with --c, like:

for (--c; c > front && ((*c & 0xC0) == 0x80); --c);

then you also don't need to test for null terminator.

EDIT: no, sorry, something is still wrong with this suggestion.

ericoporto commented 7 months ago

Thanks! I adjusted it! Edit: push forced a few safety checks.

In the current master, in string replace the mingw error appears to require that static_cast to be a reinterpret_cast<fn_strstr>(strstr). I was actually able to build locally with the mingw I have in CLion but not in the one in MSYS2 for whatever reason - not sure yet.

ivan-mogilko commented 7 months ago

I found another mistake; at first I thought it's inside BackOneChar, but it's where you calculate copylen:

size_t copylen = nonSpaceBack - nonSpaceFront + 1;

it must be

size_t copylen = nonSpaceBack - nonSpaceFront + ucwidth(c);

where c is the last met non-space character. That's because nonSpaceBack ends up before that non-space character, not after.

But, alternatively, it may be more convenient to save a pointer before advancing back next char, and use that here.

ivan-mogilko commented 7 months ago

I'm using following script as a test:

        String s = "    latin string    ";
    Display("%s", s.Trim());
    s = "    latin string    ";
    Display("%s", s.Trim());
    s = "    こんにちは    ";
    Display("%s", s.Trim());
    s = "    こんにちは    ";
    Display("%s", s.Trim());

Note that these spaces contain non-trivial spaces too (0x2000), except you cannot see that right here...

ericoporto commented 7 months ago

ok, it looks like this version works! This was the change. Thanks for figuring it out. Edit: did a push force because I rebased to the master merge to ags4.

ivan-mogilko commented 7 months ago

In the current master, in string replace the mingw error appears to require that static_cast to be a reinterpret_cast<fn_strstr>(strstr). I was actually able to build locally with the mingw I have in CLion but not in the one in MSYS2 for whatever reason - not sure yet.

I could not find a fix for this yet, if I change to reinterpret_cast, then it fails almost everywhere else, because compiler cannot deduce which function overload to take. I will look into this again tomorrow.

ericoporto commented 7 months ago

One possible thing to try tomorrow is using strstr from the equivalent c++ header (std::strstr), usually those have more consistent types across implementations.

ericoporto commented 7 months ago

Other possibility is to have our own strstr that wraps strstr so we have a way to ensure its calling convention. Picked the idea from here. The other possibility is to move the call to it's own function that takes a bool if it's case sensitive, avoiding the need for a function pointer.

ivan-mogilko commented 7 months ago

Small note, I simplified StepCharBack to not have a separate if check: https://github.com/adventuregamestudio/ags/commit/918a257fcd384105dc6a85f9c9ac36030b606228

Added small automatic test for this function: https://github.com/adventuregamestudio/ags/commit/2d994922d88c22a7d265a0a6388abddf2d8696a1

I'll backport this function to master and replace instances met in code.