WeaselGames / godot_luaAPI

Godot LuaAPI
https://luaapi.weaselgames.info
Other
348 stars 27 forks source link

Bug Report: UTF-8 character parsing error #177

Closed upizpp closed 9 months ago

upizpp commented 9 months ago

Problem description: If any characters that are not in the ASKII set appear in the Lua code, i.e. UTF-8 encoded characters (whether appearing in a string or as identifier), The results of the operation are either not expected or an error is reported.

Code snippet that caused the error:

var api := LuaAPI.new()
api.do_string("""
function say()
print("你好")
end
function get()
return "世界"
end
""")
api.call_function("say", [])
print(api.call_function("get", []))

The expected result should be outputting "你好" and "世界" on the console, but in reality, a parsing error of utf-8 characters was reported.

Unicode parsing error: Invalid unicode codepoint (4f60), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (597d), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (4e16), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (754c), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (4f60), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (597d), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (4e16), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (754c), cannot represent as ASCII/Latin-1

Minimalist code:

var api := LuaAPI.new()
api.do_string("""
function say()
print("字")
end
""")
api.call_function("say", [])

Expected result: Output '字' on the console.

Actually: The error reported by the engine:

Unicode parsing error: Invalid unicode codepoint (5b57), cannot represent as ASCII/Latin-1
Unicode parsing error: Invalid unicode codepoint (5b57), cannot represent as ASCII/Latin-1

If the call is to do_file, no error will be reported, but the printed result is garbled.

var api := LuaAPI.new()
api.do_file("res://test.lua")
api.call_function("say", [])
-- test.lua
function say()
    print("字")
end

Expected Output: Actually Output: 字

Environment: OS: Windows 11 Godot Version: v4.1.1.stable

upizpp commented 9 months ago

It seems that this issue has already been mentioned by #152 , However, I believe that the problem should not be caused by Lua itself, but rather by the conversion process between Lua's string and gdscript's string types. Even if Lua does not use the UTf8 library and uses native Lua strings, it can still handle UTF-8 strings well. It is only necessary to consider some additional issues.

I hope the following code can provide you with some inspiration for fixing the problem.

--- Due to the varying length of each character in UTF-8, this function is required.
function string.byte_length(b)
    local d
    if b > 239 then
        d = 4
    elseif b > 223 then
        d = 3
    elseif b > 128 then
        d = 2
    else
        d = 1
    end
    return d
end

-- The reason for correctly segmenting each character in a UTF-8 string and returning it as a table
-- is still due to the indefinite length of UTF-8 characters.
-- Example:
-- argument: "你好,world!"
-- result: {"你", "好", ",", "w","o","r","l","d", "!"}
function string:to_table()
    local result = {}
    local i = 1
    while true do
        local b = self:sub(i, i):byte()
        local d = string.byte_length(b)
        local s = self:sub(i, i + d - 1)
        table.insert(result, s)
        i = i + d
        if (i > self:len()) then
            break
        end
    end
    return result
end
-- usage
s = "你好,world!"
t = s:to_table()
for i = 1, #t do
      print(t[i])
end

Output:

你
好
,
w
o
r
l
d
!
upizpp commented 9 months ago

Attached is the C++implementation version of the Lua code mentioned above.

#include <string>
#include <vector>
#include <iostream>

// Note that the unsigned char function needs to be used to function properly.
int byte_length(unsigned char b)
{
    if (b > 239) return 4;
    if (b > 223) return 3;
    if (b > 128) return 2;
    return 1;
}

std::vector<std::string> split(std::string str)
{
    std::vector<std::string> result;
    for (int i = 0; i < str.length();)
    {
// Be sure to use unsigned char.
        unsigned char byte = str[i];
        int len = byte_length(byte);
        std::string s = str.substr(i, len);
        result.push_back(std::move(s));
        i += len;
    }
    return result;
}

int main()
{
    std::string str;
    std::cin >> str;
    std::vector<std::string> splited = split(str);
    std::cout << splited.size() << std::endl;
    for (auto&& s: splited)
    {
        puts(s.c_str());
    }
    return 0;
}

I think that without modifying the source code of the Lua string implemented in the C language of Lua, adding appropriate conversion functions to the strings of Lua and Godot can solve this problem.

upizpp commented 9 months ago

I just went to check the list of released versions and found that the reason for the problem is that I used a version before v2.1 beta 6. After switching to the latest version that supports 4.1. x, the issue with utf8 was resolved

Trey2k commented 9 months ago

I think that without modifying the source code of the Lua string implemented in the C language of Lua, adding appropriate conversion functions to the strings of Lua and Godot can solve this problem.

Yea, the original issue had some misnomers due to a misunderstanding by me. Luckily I did get it working. It was mostly changing a bunch of .ascii() calls to .utf8() calls. As well as making sure to parse char* as utf8.