aichaos / rivescript-js

A RiveScript interpreter for JavaScript. RiveScript is a scripting language for chatterbots.
https://www.rivescript.com/
MIT License
378 stars 145 forks source link

regex doesnt work for UTF8 #147

Closed dcsan closed 7 years ago

dcsan commented 8 years ago

I'm trying some chinese inputs, and the [*] format doesn't seem to behave

    + [*] 你好 [*]
    - wrapping 你好

image

so as you can see using normal western code ww on the sides of the chinese characters is OK, but the [*] isn't matching if using chinese characters, with or without a space. I also tried the rive pattern without spaces, ie

+ [*]你好[*]

although i'm not sure what the best practice is here.

FWIW normal * is matching OK:

    + 你好
    - 你-》你好<get nickname>

    + 你好 *
    - 你-》<star>

    + [*] 你好 [*]
    - wrapping 你好

image

kirsle commented 8 years ago

Seems to affect any non-ASCII characters.

Try to match "é 你好 é" against [*] 你好 [*] ((?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+)你好(?:(?:\s|\b)+(?:.+?)(?:\s|\b)+|(?:\b|\s)+))
Reply: ERR: No Reply Matched
kirsle commented 8 years ago

Python isn't affected, using the same regexp:

[RS] Try to match 'é 你好 é' against '[*] 你好 [*]' ('^(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\s|\\b))$')
[RS] Found a match!
[RS] Reply: wrapping 你好

I'll have to do some experimenting later to see whether this is a JavaScript/ES5 Unicode regexp bug and if ES6 Unicode-aware regular expressions will match this where the old regexps won't.

Edit:

Without the u flag, . matches any BMP symbol except line terminators. When the ES6 u flag is set, . matches astral symbols too.

Looks like a likely source of the problem. The Chinese symbols you used aren't in the BMP set.

dcsan commented 8 years ago

Basic Multilingual Plane (BMP). This plane contains most of the characters needed for scripts and languages in routine use in the world today. The plane is nearly filled with only 128 of the 65,534 code points remaining to be allocated.

the characters used are "nihao" = hello, the most basic chinese phrase. So i don't think that's the problem. you could try the match with some simpler western unicode stuff like umlauts...

is there a reason to not use the u flag? for old browsers? FWIW i'm running in node5/ES6 i have set utf8 as an option to the rive interpreter.

kirsle commented 8 years ago

You're right, 你 seems to be in the BMP plane in the CJK Unified Ideographs, U+4E00 to U+9FFF block.

The u flag raises a syntax error on ES5 engines. It doesn't seem to help on Node 6 though, anyway:

var pattern = '(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)你好(?:(?:\\s|\\b)+(?:.+?)(?:\\s|\\b)+|(?:\\b|\\s)+)';
var messages = [
    "a 你好 b",
    "你好 你好 你好",
    "é 你好 é"
];

for (var i in messages) {
    var msg = messages[i];
    console.log("Try: " + msg);
    var match = msg.match(new RegExp("^" + pattern + "$", "u"));
    if (match) {
        console.log("Matched!");
    }
}

So at least we can rule out that being the cause of the problem. It might be the \b word boundary sequence in the regexp.

kirsle commented 8 years ago

Found this: https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters

Might be able to replace things like (\s|\b) to (\s|^) (maybe also (\s|$)?) and make sure it doesn't break #48 again.

dcsan commented 7 years ago

just to clarify, this will match 我怎么说*

this wont 我怎么说[*]

Lewikster commented 7 years ago

Did a couple of more testing with Chinese characters:

+ [*]你好[*]
- works

image

Chinese wild characters + Chinese tigger does NOT work. English wild letters + Chinese trigger WOKRS

+ [*]你好[*]
- works

image

Currently an alt to “[]你好[]” is

+ (*你好|你好|你好*|*你好*) 
 - works

image

dcsan commented 7 years ago

I was wondering if Rive could enable normal regexes? SuperScript has that capability I think.

kirsle commented 7 years ago

@dcsan I think I may just have to add that feature. What I've learned from porting RiveScript to 5 different languages is that A) Unicode is hard, and B) regular expression engines aren't all created equally. Things that work in regexps in one language don't work in another, and it's hard to make RiveScript support all kinds of Unicode across all versions; so allowing the end user to write a literal regular expression can enable them to fix their specific issues their own way, and avoids all the 'magic' that triggerRegexp() does that might interfere with their attempt to get a working regexp out of it.

RiveScript's predecessor supported a regexp command: everything old is new again.

dcsan commented 7 years ago

it would be a neat feature to add, and open up full regexp power as well as especially multilanguages.

I didnt know about that old perl version

btw regarding the tilde I liked very much superscripts old implementation where you could do things like ~emohello and it would expand to match a whole category of phrases (a bit like rivescript arrays but I believe using a much bigger NLP corpus). I think they removed that recently and made users call a function, but that is a nice syntax to reserve the tilde for ~= approx equal

kirsle commented 7 years ago

Closing this issue in favor of tracking the ~Regexp feature in https://github.com/aichaos/rivescript-wd/issues/6