Closed dcsan closed 7 years ago
Seems to affect any non-ASCII characters.
Try to match "é 你好 é" against [*] 你好 [*] ((?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+)你好(?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+))
Reply: ERR: No Reply Matched
Python isn't affected, using the same regexp:
[RS] Try to match 'é 你好 é' against '[*] 你好 [*]' ('^(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))$')
[RS] Found a match!
[RS] Reply: wrapping 你好
I'll have to do some experimenting later to see whether this is a JavaScript/ES5 Unicode regexp bug and if ES6 Unicode-aware regular expressions will match this where the old regexps won't.
Edit:
Without the u flag, . matches any BMP symbol except line terminators. When the ES6 u flag is set, . matches astral symbols too.
Looks like a likely source of the problem. The Chinese symbols you used aren't in the BMP set.
Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only 128 of the 65,534 code points remaining to be allocated.
the characters used are "nihao" = hello, the most basic chinese phrase. So i don't think that's the problem. you could try the match with some simpler western unicode stuff like umlauts...
is there a reason to not use the u flag? for old browsers? FWIW i'm running in node5/ES6 i have set utf8 as an option to the rive interpreter.
You're right, 你 seems to be in the BMP plane in the CJK Unified Ideographs, U+4E00 to U+9FFF
block.
The u
flag raises a syntax error on ES5 engines. It doesn't seem to help on Node 6 though, anyway:
var pattern = '(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)';
var messages = [
"a 你好 b",
"你好 你好 你好",
"é 你好 é"
];
for (var i in messages) {
var msg = messages[i];
console.log("Try: " + msg);
var match = msg.match(new RegExp("^" + pattern + "$", "u"));
if (match) {
console.log("Matched!");
}
}
So at least we can rule out that being the cause of the problem. It might be the \b
word boundary sequence in the regexp.
Found this: https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters
Might be able to replace things like (\s|\b)
to (\s|^)
(maybe also (\s|$)
?) and make sure it doesn't break #48 again.
just to clarify, this will match 我怎么说*
this wont 我怎么说[*]
Did a couple of more testing with Chinese characters:
+ [*]你好[*]
- works
Chinese wild characters + Chinese tigger does NOT work. English wild letters + Chinese trigger WOKRS
+ [*]你好[*]
- works
Currently an alt to “[]你好[]” is
+ (*你好|你好|你好*|*你好*)
- works
I was wondering if Rive could enable normal regexes? SuperScript has that capability I think.
@dcsan I think I may just have to add that feature. What I've learned from porting RiveScript to 5 different languages is that A) Unicode is hard, and B) regular expression engines aren't all created equally. Things that work in regexps in one language don't work in another, and it's hard to make RiveScript support all kinds of Unicode across all versions; so allowing the end user to write a literal regular expression can enable them to fix their specific issues their own way, and avoids all the 'magic' that triggerRegexp()
does that might interfere with their attempt to get a working regexp out of it.
RiveScript's predecessor supported a regexp command: everything old is new again.
it would be a neat feature to add, and open up full regexp power as well as especially multilanguages.
I didnt know about that old perl version
btw regarding the tilde I liked very much superscripts old implementation where you could do things like ~emohello
and it would expand to match a whole category of phrases (a bit like rivescript arrays but I believe using a much bigger NLP corpus). I think they removed that recently and made users call a function, but that is a nice syntax to reserve the tilde for ~= approx equal
Closing this issue in favor of tracking the ~Regexp
feature in https://github.com/aichaos/rivescript-wd/issues/6
I'm trying some chinese inputs, and the
[*]
format doesn't seem to behaveso as you can see using normal western code
ww
on the sides of the chinese characters is OK, but the [*] isn't matching if using chinese characters, with or without a space. I also tried the rive pattern without spaces, iealthough i'm not sure what the best practice is here.
FWIW normal * is matching OK: