aichaos / rivescript-js

A RiveScript interpreter for JavaScript. RiveScript is a scripting language for chatterbots.
https://www.rivescript.com/
MIT License
377 stars 145 forks source link

How to write optional word with empty support? #262

Open mgiejohnson opened 6 years ago

mgiejohnson commented 6 years ago

I am used rivescript in chinese chatbot. But i found difficult to use "*" for some char is optional. For example, red chair, in chinese it will be "紅色的椅子". But normally "的" is optional, which it is same meaning even you said "紅色椅子". Is there any simple way to support optional "*"?

+ 紅色*椅子
- Red chair

If i wrote this, it only support 紅色的椅子 but not 紅色椅子

+ 紅色(|*)椅子

I try to add nothing or "*" but not success.

+ 紅色椅子|紅色*椅子

This one is working, but it is too hard to keep maintain if i need to help thousand of word i need to put into this style.

Rivescript sample: https://play.rivescript.com/s/QFFb1zVd8b

dcsan commented 6 years ago

we have some other issues here about problems with UTF8 / Chinese support

https://github.com/aichaos/rivescript-js/issues/147 https://github.com/aichaos/rivescript-js/issues/253

there's a ? operator you can try: https://github.com/aichaos/rivescript-js/pull/256

mgiejohnson commented 6 years ago

For #256, it only show usage "?" replacing "+" for single keyword matching, but actually i am used for contruct single keyword with more than 1 words combian by optional connecting word. That mean i cannot put "?" between (紅色) and (椅子) like '+ (紅色)?(椅子)'

My question is near #254 but i am not talking about char detection for Japanese words, I want to used "*" as optional, just as 2nd "要" from "要出国要留学" as #254 state.

Currently i workaround it by changing brian.coffee line 623. Which changed "*" from at least one char (.+?) to allow zero occur (.*?). But it will affect other script that it used <star> which need to get value for reply. So i want to find any replacement of "*" which support zero occur.

regexp = regexp.replace(/\*/g, "(.*?)") 
kirsle commented 6 years ago

Wildcards, JavaScript and Unicode don't get along very well, as you can see from all the other issues linked above. The ? command was added recently to try and work around the use case of people wanting Unicode characters like Chinese surrounded by optional wildcards like [*], whose regexps I couldn't fix directly.

This is actually the first request I've seen for an optional wildcard between Unicode words, but the same problems apply... unfortunately though the ? command isn't adequate for this task.

If you want to do some experimenting, an idea to resolve this may be to further improve the ? command to do some conversion of wildcard symbols inside the trigger. The ? command is implemented around here, so you can see that it turns the trigger text into 5 different variations of the same trigger with different types of wildcards.

My idea is to have it search out any * wildcard in the trigger, and convert it into variations that include different kinds of wildcards or none at all. So example...

// the original trigger in the source file
? 紅色*椅子
- Red chair.

// already: the ? command converts the trigger into
// these 5 variations
+ 紅色*椅子
+ [*]紅色*椅子[*]
+ *紅色*椅子*
+ [*]紅色*椅子*
+ *紅色*椅子[*]

// proposed new feature: do those same 5 variations, but
// with the internal wildcards also varied:
//     *
//     [*]
//     (missing)
+ 紅色*椅子
+ 紅色[*]椅子
+ 紅色椅子
// and the various keyword suffix/prefix wildcards from above
+ [*]紅色[*]椅子[*]
+ [*]紅色椅子[*]
+ [*]紅色[*]椅子*
+ [*]紅色椅子*
// etc.

One trigger will end up with a lot of variations but these can be calculated on the fly. Not sure how efficient the solution is though. :smiley:

dcsan commented 6 years ago

I'm not quite sure what ? is for. is it for a normal regex? or some other variation of such? we're using a much more basic trigger system now in our system where we just have a plain regex that gets searched using /gmi flags.

the initial simplified regexes worked kind of OK for content people but I'm not sure having some other variation for more complex UTF8 phrases, vs real regexes with all their tooling, is going to be a win.

for example you can use you said (?<something>.*) named capture groups and various other insanely gnarly but documented things I would not want to duplicate

kirsle commented 6 years ago

So the ? command was added to work around limitations in JavaScript's regexp engine (along with the simplifed regexp system RiveScript uses to make the triggers more user-friendly), with regards to Unicode and some regexp metacharacters.

Here's the part of code that handles optionals.

Optionals have the strange requirement that: the user's message can either provide the word(s) from the optional, in which case it must be surrounded by spaces (or word boundaries, in the case that the optional is at the beginning or end of the trigger), or the optional must be entirely missing, but there has to be at least a space or word boundary in its place.

For the trigger + what is your [phone] number it means you can say "what is your phone number" or "what is your number", but you can not say "what is yournumber" (notice no space where the optional word was, and the other words running together). This was actually a problem in the beginning (see #48), and the fix was to add the \b|\s to require a word boundary or space character to handle the case where the user didn't include the optional word.

The problem then is that the \b word-boundary character only works on ASCII A-Za-z0-9 characters but not against any other Unicode outside that range. That makes the optionals not work at all with Chinese and other languages. I couldn't find a way to handle the optionals regexp to work with Unicode in a way that doesn't also reintroduce bugs like #48.

Since it was a really common use case to want "keyword triggers" like + [*] 你好 [*], the ? command came in to work around that by translating the original trigger text into ALL the variations of wildcards to make sure it matches in all cases.


So I played with the idea of allowing users to provide raw regular expressions without any of the magic or simplified syntax that RiveScript's triggers support: https://github.com/aichaos/rivescript-wd/issues/6

It became hard though to handle the sorting of the regexps. With RiveScript's trigger system, it's able to count the number of normal words, wildcards, optionals, alternative sets and so-on and organize the triggers in a "most specific first" ordering, so that the longest trigger with the most words and fewest wildcards will be tested first when looking up a reply, so that less specific triggers don't overshadow more specific ones and make matching a nightmare.

(For a quick example of ordering triggers, if you were writing a completely custom bot doing regexp tests yourself you might have code like this:)

# forgive the Perl but it's fast to write

if ($message =~ /^who is (.+?)$/) {
    # trigger: who is *
    $reply = "I've never heard of $1 before.";
}
elsif ($message =~ /^who is linus torvalds$/) {
    $reply = "He is the father of Linux.";
}

Here a less specific trigger of who is * appears before a more specific one of who is linus torvalds -- if the bot is organized in this way, it's impossible to ask about Linus Torvalds because the wildcard trigger was tested first and gives its response. RiveScript's triggers would automatically arrange these the opposite way so the wildcard trigger is tested after and it all works out how the bot author expected. (And in case it didn't, there's the {weight} tag to fine-tune the priority of a trigger).

Allowing users to write raw regular expressions though complicates the sorting -- regexps tend to be shorter and more concise than full sentences in some cases. Since you can't easily count the words and wildcards in a raw regexp it's hard to rank them in the same way as the normal triggers; you can either sort them by pure character length (in which case many of them will appear much further down the sort list than you'd want) or try and come up with something more complicated.

So the ? command was picked as an easier alternative to raw regexps even though it still has its problems.

dcsan commented 5 years ago

if the sorting is the only problem, then let the regexps match just in the order that the user provides them in the script? That seems an easy solution.

if you could add a weight option to regexp too then for example with matches across different files we could control the ordering.