Closed hoofcushion closed 5 months ago
#define UNICODE_VALID(Char) \
((Char) < 0x110000 && (((Char) & 0xFFFFF800) != 0xD800) && \
((Char) < 0xFDD0 || (Char) > 0xFDEF) && ((Char) & 0xFFFE) != 0xFFFE)
我找到了这段宏定义,转换成如下 lua 形式,在 yield 前检查一遍就可以避免触发崩溃了,仅仅因为 utf8 字符串不合法就崩溃是不是不太合理,请问能否做些调整?
local function unicode_valid(char)
return char<0x110000 and
((char&0xfffff800)~=0xd800) and
(char<0xfdd0 or char>0xfdef) and
(char&0xfffe)~=0xfffe
end
首先,fcitx 使用 dbus 进行通信,dbus 要求所有的 string 都是合法的 utf8 string,如果我不 crash 直接发送,那别的库就会替我 crash
其次,所有来自 engine 的非法的 string,都认为是 engine 的 bug,所以与其校验后替换为一个空字符串,我宁愿直接 crash。
请不要发送非法的字符串。
不好意思,我还有疑问,非字符 (noncharacter) 不应该是非法字符,至少对于非字符,fcitx 可以尝试保留或替换为空字符串,这在任何文本流中都应该是无害的。 Corrigendum #9: Clarification About Noncharacters Are noncharacters invalid in Unicode strings and UTFs? Can failing to replace noncharacters with U+FFFD lead to problems?
在 dbus 中,非字符是否合法的问题也在被澄清了,我不太清楚其他库具体是什么情况,但是对于非字符串来说,认为他们是非法字符串,或者 dbus 会因此崩溃可能是不合适的。 Specification: explicitly allow the Unicode noncharacters Bug 63072 - allow Unicode non-characters as per Corrigendum 9 If my application makes specific, internal use of a noncharacter, what should I do with input text?
不好意思,我还有疑问,非字符 (noncharacter) 不应该是非法字符,至少对于非字符,fcitx 可以尝试保留或替换为空字符串,这在任何文本流中都应该是无害的。 Corrigendum #9: Clarification About Noncharacters Are noncharacters invalid in Unicode strings and UTFs? Can failing to replace noncharacters with U+FFFD lead to problems?
从勘误表#9引文
Noncharacters in the Unicode Standard are intended for internal use
而输入法作为一个跨应用程序、混成器、输入法框架、输入法引擎的架构,我认为并不符合 internal use 的定义,所以不应该在输入法的架构里传递 noncharacter.
Corrigendum #9: Clarification About Noncharacters
The real intent of noncharacters is that they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged.
Change D14 in Section 3.4, Characters and Encoding, as indicated: Noncharacter: A code point that is permanently reserved for internal use
and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
Unicode 对此的解释很清晰,可交换的文本中出现非字符并不被禁止,非字符的主要用途是“内部使用”并且被“永久保留”,并不意味着他不能被交换,正因如此才需要澄清 "should never be interchanged" 的错误定义,不然Corrigendum #9
就没有意义了。
Corrigendum #9: Clarification About Noncharacters
The real intent of noncharacters is that they are permanently prohibited from being assigned standard, interchangeable meanings, rather than that they are prohibited from occurring in Unicode strings which happen to be interchanged.
Change D14 in Section 3.4, Characters and Encoding, as indicated: Noncharacter: A code point that is permanently reserved for internal use ~and that should never be interchanged~. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF.
Unicode 对此的解释很清晰,可交换的文本中出现非字符并不被禁止,非字符的主要用途是“内部使用”并且被“永久保留”,并不意味着他不能被交换,正因如此才需要澄清 "should never be interchanged" 的错误定义,不然
Corrigendum #9
就没有意义了。
我并不是说「不能」,而是说「无意义」,非字符的使用需要交换双方对其含义有一致的定义,否则双方不一定能正确处理非字符的存在(比如显示的时候怎么处理、输出的时候),而输入法的交换对象存在大量不受输入法控制的第三方应用,除非输入法协议约定了对非字符的处理方式,否则接收到输入法传输的非字符的应用程序不一定能正确处理非字符,进而产生各种非预期结果。再者,输入法协议传输的字符串,无非是两种,用于显示的,和用于输入的,而非字符对于这两种用途都是毫无意义的,因为非字符既不能被显示,也不是接收文本的程序预期的输入。
用户可能就是想输入这个字符,而且 Unicode 也并不禁止,非字符在文本流中也是无害的,除非其他程序刻意对非字符崩溃,这种行为与刻意对其他合法字符崩溃无异,是这些程序的漏洞,而不是输入法的。
@hoofcushion 既然他们改了我们可以改成一样的
Summary
Render "invalid utf8 string" will crash fcitx5, for example: 0xffff (has no unicode representation)
Steps to Reproduce
yield(Candidate("",0,0,utf8.char(65535),""))
inside a lua_translatorExpected Behavior
Don't crash when an "Invalid utf8 string error" occurs. Maybe skip them all when rendering.
Output of fcitx5-diagnose command