Preprocessor gets infinite loop meeting a non-latin unicode char in script

ivan-mogilko commented 4 months ago

Problem If the game project is configured into Unicode mode, and there's a non-latin character met in plain script (except comments or character string literals in single or double quotes respectively), then the script preprocessor would stuck in an infinite loop while trying to process the scripts.

Steps to reproduce

Create a project in Unicode mode.
Open a script and place a random non-latin char on a new line.
Run game build.
Observe that the process is stuck at the script compilation forever.

Expected behavior I suppose it's expected that preprocessor should produce a warning or error about invalid character found in script.

Additional That would be a good idea to also test compilers with the same case, as both old and new one were written back when scripts were strictly ASCII.

On a side note, we might plan adding a "Abort/Cancel" button to a BusyDialog that runs a working process. It's weird how this was not featured from the start.

ericoporto commented 4 months ago

I think the place to put the check that would give the error message would be here:

https://github.com/adventuregamestudio/ags/blob/d790aeabd193168f2ee9cfad576b82459978ed5f/Editor/AGS.CScript.Compiler/Preprocessor.cs#L390

Not sure how it would break from here. Does any character puts it in infinite loop or is there one specific that you verified?

Writing a new test for this would be a good way to debug this

https://github.com/adventuregamestudio/ags/blob/master/Editor/AGS.Editor.Tests/ScriptCompiler/PreprocessorTests.cs

Then it's a matter of adjusting the preprocessor to produce the desired behavior in the test.

ivan-mogilko commented 4 months ago

Does any character puts it in infinite loop or is there one specific that you verified?

Multiple characters were reported by different users, and i tried with a random one, so I assume that any character which unicode index exceeds either 127, or 255 (to make it safe).

ericoporto commented 4 months ago

OK, I understand why the infinite loop, it happens because of GetNextWord, instead of it going until the next break (edit: the hard part is a lot of things can be a break though, not just spaces apparently) or dot, to retrieve the word, it uses the valid ascii characters as check and breaks on everything else.

I think it's best that GetNextWord just use breaks and then we can have an additional function that does the required validation. Edit: but it's apparently super hard to do this in the code, or at least it breaks a ton of things when I try.

Edit: made a way to workaround with the following function, that essentially try to get what is the next non-scriptWordChar and then if it's not something that would be valid there it throws an error. There is a bit of dancing around to set the text that is passed as ref because otherwise we get in inifinite loop.

private string GetNextWord(ref string text, bool trimText, bool includeDots)
{
    int i = 0;
    while ((i < text.Length) && 
                 (text[i].IsScriptWordChar() ||
                  (includeDots && (text[i] == '.')))
                 )
    {
        i++;
    }
    // added this
    if(i < text.Length && !(char.IsWhiteSpace(text[i]) || text[i] == ';' || text[i] == ',' || text[i] == '(' || text[i] == ')' || (includeDots && (text[i] == '.'))))
        {
        RecordError(ErrorCode.InvalidCharacter, "Invalid character detected at position " + i.ToString() + " in '" + text + "'");
        string res = text;
        text = string.Empty; // need to end line
        return res;
    }
    // end of what was added

    string word = text.Substring(0, i);
    text = text.Substring(i);
    if (trimText)
    {
        text = text.Trim();
    }
    return word;
}

Btw, I am not sure if char.IsWhiteSpace is a correct function because it says it assumer UTF-16 and I can't tell if the string is utf-8 or utf-16...

Added the following tests

```C# [Test] public void NonLatinUnicodeComment() { // Non-latin unicode characters in comments raises no error, as comments are removed IPreprocessor preprocessor = CompilerFactory.CreatePreprocessor(AGS.Types.Version.AGS_EDITOR_VERSION); string script = $@" // this is a comment // this is a comment with invalid latin characters Иह€한𐍈 <- here // 1234 //#define invalid 5 // invalid int i; "; string res = preprocessor.Preprocess(script, "ScriptName"); Assert.That(preprocessor.Results.Count == 0); string script_res = $@"""__NEWSCRIPTSTART_ScriptName"" int i; "; AssertStringEqual(res, script_res); } [Test] public void NonLatinUnicodeInString() { // Non-latin unicode characters in strings raises no error, as they aren't parsed here IPreprocessor preprocessor = CompilerFactory.CreatePreprocessor(AGS.Types.Version.AGS_EDITOR_VERSION); string script = $@" // this is a comment // the string below has invalid latin characters string st = ""aИह€한𐍈""; //#define invalid 5 // invalid int i; "; string res = preprocessor.Preprocess(script, "ScriptName"); Assert.That(preprocessor.Results.Count == 0); string script_res = $@"""__NEWSCRIPTSTART_ScriptName"" string st = ""aИह€한𐍈""; int i; "; AssertStringEqual(res, script_res); } [Test] public void NonLatinUnicodeInScript() { // Non-latin unicode characters in script vars raises error, as they aren't allowed later on the compiler IPreprocessor preprocessor = CompilerFactory.CreatePreprocessor(AGS.Types.Version.AGS_EDITOR_VERSION); string script = $@" // the variable below has invalid latin characters int aИह€한𐍈 = 15; int i; "; preprocessor.Preprocess(script, "ScriptName"); Assert.That(preprocessor.Results.Count, Is.EqualTo(1)); Assert.That(preprocessor.Results[0].Code, Is.EqualTo(ErrorCode.InvalidCharacter)); } ```

adventuregamestudio / ags

Preprocessor gets infinite loop meeting a non-latin unicode char in script #2480