SEPIA-Framework / sepia-docs

Documentation and Wiki for SEPIA. Please post your questions and bug-reports here in the issues section! Thank you :-)
https://sepia-framework.github.io/
238 stars 16 forks source link

Support for more languages #3

Closed mavrosxristoforos closed 5 years ago

mavrosxristoforos commented 5 years ago

Is your feature request related to a problem? Please describe. There is currently no obvious way to support more languages. SEPIA only works with English and German.

Describe the solution you'd like I would like it to support more languages.

Describe alternatives you've considered A good starting point would be to write some documentation about how to contribute and extend language support for more languages.

Additional context I would like to help with support for the Greek language.

fquirin commented 5 years ago

Hi Christopher,

building an (relatively) easy way to integrate more languages is definitely on the to-do list, but it is a very complex topic. In its core SEPIA supports more than just English and German as you can see for example when you check the lists of "hard-coded" answers and commands (https://github.com/SEPIA-Framework/sepia-assist-server/tree/master/Xtensions/Assistant/commands), but there is basically no NLU code for these languages yet so they are disabled at the moment. The first step would be to translate these lists (especially the answers, Greek is the 'el' list). Besides that there is an enormous amount of individual rules (mostly via regular expressions) that help to interpret sentences. Currently I'm working on implementing Apache OpenNLP to make intent-recognition and named-entity-recognition (NER) more flexible so we can get rid of most of the rules. I'm not sure how well this will work for Greek since some languages rely more on part-of-word combinations than others, e.g. turkish (you can basically build a sentence with one word). I've built a PoC a while ago to test this: https://github.com/fquirin/java-nlu-tools It would be interesting to get some feedback on how good that works for Greek ;-)

Expect to see a new SDK soon that will give you more info and a better overview of what is necessary to convert a service to Greek.

mavrosxristoforos commented 5 years ago

I cloned the assist-server and started translating the existing answers, commands and teachIt files. Waiting for the next SDK!

mavrosxristoforos commented 5 years ago

I translated the /data test files from java-nlu-tools to Greek and ran the tests. My translation approach was making the command natural to Greek, but maintaining the English wording as much as possible. Here are the results:

MalletNerDemo Good: 7, bad: 9, prec.: 0.4375 Took: 1090ms

MalletNlpIntentDemo Good: 15, bad: 2, prec.: 0.8823529411764706 Certainty average (good): 0.7231281416655222 Took: 606ms

OpenNlpNerDemo Good: 9, bad: 7, prec.: 0.5625 Took: 1504ms

OpenNlpIntentDemo Good: 13, bad: 4, prec.: 0.7647058823529411 Certainty average (good): 0.7757466598149811 Took: 783ms

fquirin commented 5 years ago

Thanks a lot for the info and work! It looks like the system does not perform as good as it does with German and English :-( Could you maybe explain a bit how Greek does build sentences like "Show me the way from A to B" or "Switch on the lights in the living-room"? I once wanted to build a test for Turkish but due to the totally different grammar it requires some new techniques that can tokenize sentences differently. We can always fallback to regular expressions in that case which will also be part of the SDK but I'd prefer to avoid that.

mavrosxristoforos commented 5 years ago

Except for a few idiomatic expressions that you have written in the test files, the rest can be directly translated to Greek, word by word, although it may not feel very natural. That's why I included a few more natural expressions, but I guess it needed more tests. Here's an example: "Show me the way from A to B" is "Δείξε μου τον δρόμο από το Α στο Β". The only reason why it may need more training is that the articles are gender specific: "Show me the way from [the: gender specific article] A [to: gender specific article] B". "Switch on the lights in the living-room": "Άναψε τα φώτα στο σαλόνι". It has very good potential, with only a few more difficulties than English.

mavrosxristoforos commented 5 years ago

I got the chance now and have a few spare minutes, so here are a few more examples: "Show me the way from A to B" is "Δείξε μου τον δρόμο από το Α στο Β", which does sound natural in Greek, but it may not be what a Greek person would say to a GPS system. They would probably say something like: "Route to B" or "Way to B" or "Directions to B" ("Διαδρομή προς Β" or "Δρόμος για Β" or "Οδηγίες για Β"). Including the LOC_START, that would be "Οδηγίες από Αθήνα για Θεσσαλονίκη" or "Οδηγίες από Αθήνα προς Θεσσαλονίκη" or "Οδηγίες από την Αθήνα προς τη Θεσσαλονίκη" (notice the gender-specific "την" article may or may not have a trailing n (Greek: ν) depending on the first letter of the next word. So the words are not usually merged in common sentences, but especially for the route directions issue there may a lot of varieties of the same expression. A larger training vocabulary would do the trick.

fquirin commented 5 years ago

Hi Christopher,

sorry for the late reply, I'm a bit in Xmas stress =) From what you've explained it sounds Greek has the potential to work well with SEPIA. Most of the issues you've described are actually also present in German (to be honest German is a pain in the ass compared to English regarding the NLU =) ). I think what we should try is to boost the accuracy by adding more examples. In theory all the machine learning methods require a much larger corpus than what I've used in my examples but sometimes it seems to work well with a smaller set too.

I've made some progress with the SDK. Although the first version will not support OpenNLP yet (it requires some work on the NLU-chain of the Assist-server) it will be able to define 3 different levels of command definitions: direct-match, match-with-variable and regular-expression-match:

Direct-match is basically when a given sentence is known by the system (like the sentences given in the commands/...txt files). Match-with-variable is similar but allows to define certain words as variables, this is a bit tricky to use as it can mess around a lot with the default NLU e.g. if you define a command like "show me <item>" which could basically fit to every service (show me news/weather/directions etc.). But when used wisely it it can be very useful. Regular-expression-match is a very powerful way to recognize intents and extract paramaters. It is similar to match-with-variable but much more flexible on the other hand it does not give you back the parameters (variables/named-entities) automatically but can be combined very well with the given ones.

I hope this will give us enough options to play around a bit with Greek :-) Expect the first test version in the next days ^^.

mavrosxristoforos commented 5 years ago

Alright, sounds like a lot is coming soon. I will close this issue and see how it goes. Thank you for your time.