Closed upizpp closed 9 months ago
It seems that this issue has already been mentioned by #152 , However, I believe that the problem should not be caused by Lua itself, but rather by the conversion process between Lua's string and gdscript's string types. Even if Lua does not use the UTf8 library and uses native Lua strings, it can still handle UTF-8 strings well. It is only necessary to consider some additional issues.
I hope the following code can provide you with some inspiration for fixing the problem.
--- Due to the varying length of each character in UTF-8, this function is required.
function string.byte_length(b)
local d
if b > 239 then
d = 4
elseif b > 223 then
d = 3
elseif b > 128 then
d = 2
else
d = 1
end
return d
end
-- The reason for correctly segmenting each character in a UTF-8 string and returning it as a table
-- is still due to the indefinite length of UTF-8 characters.
-- Example:
-- argument: "你好,world!"
-- result: {"你", "好", ",", "w","o","r","l","d", "!"}
function string:to_table()
local result = {}
local i = 1
while true do
local b = self:sub(i, i):byte()
local d = string.byte_length(b)
local s = self:sub(i, i + d - 1)
table.insert(result, s)
i = i + d
if (i > self:len()) then
break
end
end
return result
end
-- usage
s = "你好,world!"
t = s:to_table()
for i = 1, #t do
print(t[i])
end
Output:
你
好
,
w
o
r
l
d
!
Attached is the C++implementation version of the Lua code mentioned above.
#include <string>
#include <vector>
#include <iostream>
// Note that the unsigned char function needs to be used to function properly.
int byte_length(unsigned char b)
{
if (b > 239) return 4;
if (b > 223) return 3;
if (b > 128) return 2;
return 1;
}
std::vector<std::string> split(std::string str)
{
std::vector<std::string> result;
for (int i = 0; i < str.length();)
{
// Be sure to use unsigned char.
unsigned char byte = str[i];
int len = byte_length(byte);
std::string s = str.substr(i, len);
result.push_back(std::move(s));
i += len;
}
return result;
}
int main()
{
std::string str;
std::cin >> str;
std::vector<std::string> splited = split(str);
std::cout << splited.size() << std::endl;
for (auto&& s: splited)
{
puts(s.c_str());
}
return 0;
}
I think that without modifying the source code of the Lua string implemented in the C language of Lua, adding appropriate conversion functions to the strings of Lua and Godot can solve this problem.
I just went to check the list of released versions and found that the reason for the problem is that I used a version before v2.1 beta 6. After switching to the latest version that supports 4.1. x, the issue with utf8 was resolved
I think that without modifying the source code of the Lua string implemented in the C language of Lua, adding appropriate conversion functions to the strings of Lua and Godot can solve this problem.
Yea, the original issue had some misnomers due to a misunderstanding by me. Luckily I did get it working. It was mostly changing a bunch of .ascii() calls to .utf8() calls. As well as making sure to parse char* as utf8.
Problem description: If any characters that are not in the ASKII set appear in the Lua code, i.e. UTF-8 encoded characters (whether appearing in a string or as identifier), The results of the operation are either not expected or an error is reported.
Code snippet that caused the error:
The expected result should be outputting "你好" and "世界" on the console, but in reality, a parsing error of utf-8 characters was reported.
Minimalist code:
Expected result: Output '字' on the console.
Actually: The error reported by the engine:
If the call is to
do_file
, no error will be reported, but the printed result is garbled.Expected Output:
字
Actually Output:å
Environment: OS:
Windows 11
Godot Version:v4.1.1.stable