Closed Lindafr closed 4 years ago
This definitely fits the goal of this package. It's now on the TODO.
Are you interested in collaborating on this one?
Hi! Collaboration would be interesting, but I doubt I have enough time. I tried to hack Rasa and add some Stanza Estonian lemmatization straight into SpacyTokenizer, but failed so far. I guess those, who are more familiar with the code, find the task easier.
How about this, around the time that I think I might have something. Could you have a peek and give a review? The main thing I'd like to get a second pair of eyes on is the stanza
package because I've never used it.
It feels like the components could be split up though.
Yes, I can help with reviews! I can also try and answer any questions you have about stanza. I might be able to answer them, since I have used it more.
Estonian works with white-space tokeniser. It is used commonly.
Entity detection with stanza is quite alright and best available resource in a sense that it can be easily used in other applications as well. Several Estonian language technologies that companies in Estonian use have stanza in their tummy.
Lemmatisation is crucial for me and it is the whole reason I started commenting in the blog. Estonian (like many other languages, including Finnish and Hungarian) is mostly agglutinative language. This means that substantive can have 29+-1 different forms and verbs ca 93 different forms (sic!). If I don't lemmatize the text, Rasa thinks all those different forms are separate words and that might increase the need of more data and decrease the efficiency of Rasa. Since we don't have any good lemmatized word vector models trained on big data available, I guessed it would be worth trying to take the FastText (or the BytePair) non-lemmatised text embeddings and take it as one feature and during tokenization replace the word with it's lemma so further in pipeline Rasa has the word's non-lemmatized embedding and sees the lemma.
Interesting!
Out of curiosity. Considering that our pipeline has a CountVectorizer
that can also use 2/3/4-grams on a character-level I wonder what can be gained by adding Lemmatisation. In the case of ütlevad
and ütlen
you might still have the ütl
as a common feature, no?
I've just checked with the research team, we actually offer lemmatisation in our CountvectorFeaturizer
if you use a spaCy model. This suggests another route that might also be useful ... how can we easily create spaCy
-compatible models that rely on stanza as a backend.
Regarding the 2/3/4-grams question.
Estonian is also a bit fusional language, that means we have verbs which root also changes ("hüpa-ta", "hüppa-n" or "võidel-da", "võitle-n" or "tõmba-n", "tõmma-ta"). We also have tons of compound words. Lemmatisation helps to reduce the set of words or char-grams. It might not be so helpful when using 'char' analyzer, but helps a lot when using 'word' analyzer. Experience so far has shown that char-level CountvectorFeaturizer
do not give so good results in Estonian and I planned to start testing Rasa with word-level CountvectorFeaturizer
.
Regarding your last question, I don't know. I tried to hack SpacyTokenizer
in a way that in function tokenize(self, message: Message, attribute: Text) -> List[Token]:
it returns Token() that's content is given by stanza instead. I ended up with another error down the pipeline ("ValueError: Sequence dimensions for sparse and dense features don't coincide in...")
The Rasa word level CountVectorizer uses the .lemma_
if SpacyTokenizer
is present. This suggests that if you have a spaCy model with a custom .lemma_
implementation that it should work. I wouldn't know how easy/hard it is to implement this but it seems worth to check out. I'll keep you posted if I learn anything.
@Lindafr I'm wondering what the best approach is here. I can attempt to get stanza
into something that is spaCy compatible but with the advent of spaCy 3.0 as well as a lot of details in the "getting it right" department I'm currently leaning towards just making a Rasa component.
Would you agree that the POS tags are probably the most important feature to get started with? I should be able to add these as sparse features for the machine learning pipeline.
Hi @koaning , Rasa component would be easy for the end user. I, myself, was thinking down the line of just substituting spaCy .lemma_
values with Stanza values (more like a hack than a beautiful pipeline solution).
For me, the most important feature would be lemmas. POS-es (and NERs?) come after that.
I've talked to the research team about lemmas and it seems like we've never really supported them directly. Only indirectly for spaCy pipelines. There's interest in exploring it further but it may take time to get to a consensus on what is the best approach. Until then, I'll keep in mind that when I've got time to work on this; POS
is a good candidate to start out with.
I will be starting with a tokenizer first. The reason is that internally, if you add a lemma property to a token, the countvectorfeaturizer is able to pick up the lemma instead of the word.
@Lindafr I have a first implementation up and running and you should be able to play with it. I've only tried it out with English sofar on my local machine but you should be able to use the implementation on the PR that is linked for any supported Stanza language.
If you've got the time and you'd like to play with it. You should be able to download the stanzatokenizer.py
file that is in the PR locally and open the tools in jupyter. You should be able to play around with the tools using code similar to;
from stanzatokenizer import StanzaTokenizer
from rasa.nlu.training_data import Message
# You can change the language setting here
tok = StanzaTokenizer(component_config={"lang": "en"})
# This is a Rasa internal thing, you need to wrap text with an object
m = Message("i am running and giving many greetings")
tok.process(m)
# You should now be able to check the properties of the message.
[t.text for t in m.as_dict()['tokens']]
[t.data.get('pos', '') for t in m.as_dict()['tokens']]
[t.lemma for t in m.as_dict()['tokens']]
Per example, the last three lists on my machine were;
['i', 'am', 'running', 'and', 'giving', 'many', 'greetings', '__CLS__']
['PRON', 'AUX', 'VERB', 'CCONJ', 'VERB', 'ADJ', 'NOUN', '']
['i', 'be', 'run', 'and', 'give', 'many', 'greeting', '__CLS__']
The __CLS__
token is also a Rasa internal thing, you should see it make an appearance. This is the token we use internally to represent the entire message. If you can confirm that the results you see there are sensible then I can move on to the next phase of the implementation :)
Hi @koaning , I just found this. I didn't know it existed before but it seems that maybe integrating stanza to spaCy is even easier and you won't have to invent a whole new pipeline.
I'll take a look at this stanzatokenizer.py
first thing tomorrow morning and then I'll investigate the spacy_stanza module.
If either works, let me know :)
Having glanced at the docs, it seems like the stanza-spacy
plugin would be the preferable route. It's even hosted by explosion. Let me know if you're having trouble linking it with Rasa! The docs suggest you should be able to just save the model to disk via
nlp.to_disk("./stanza-spacy-model")
To properly link it you might need to do the same steps as I'm taking here.
Hi @koaning ,
Your code works great:
I got an error withstanza-spacy
while following it's tutorial, so I haven't had the chance of trying to link it with Rasa yet, because I couldn't even get it to work outside Rasa (nlp = spacy.load("./stanza-spacy-model", snlp=snlp)
doesn't work after nlp.to_disk("./stanza-spacy-model")
). I'm investigating it atm, but your code gives the right results for Estonian.
Thanks for letting me know. I'm having a lunch break now but I might have some time to have a look at the spaCy bindings for stanza.
Just to check, do you also have a Rasa project? You should now also be able to use that tokenizer in config.yml
. The following configuration should automatically pick up the pos
and the lemma
from the pipeline. The lemma
is a bit of a hidden feature of the countvectorizer. This configuration assumes that you've placed the stanzatokenizer.py
file in the root of your Rasa project.
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
- name: CountVectorsFeaturizer
OOV_token: oov.txt
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: DIETClassifier
epochs: 200
If you run it with stanzatokenizer.StanzaTokenizer
vs. WhitespaceTokenzier
you should get different results in your NLU scores.
I've just tried trying to use spaCy directly and I had some troubles. There's a caveat mentioned here and I'm getting this error when I try to package it all up;
TypeError: __init__() missing 1 required positional argument: 'snlp'
I fear that we'd need to construct a special binding around this spacy-stanza package to get it to work for the Rasa use-case. It's do-able but it might make more sense to build the stanza feature directly for Rasa.
Did you try this on English? With Estonian I get the language error that should be solved ("But because this package exposes a spacy_languages entry point in its setup.py that points to StanzaLanguage, spaCy knows how to initialize it.").
import spacy
import stanza
from spacy_stanza import StanzaLanguage
#stanza.download("et")
snlp = stanza.Pipeline(lang="et")
nlp = StanzaLanguage(snlp)
nlp.to_disk("./stanza-spacy-model")
nlp = spacy.load("./stanza-spacy-model", snlp=snlp)
My hope was that this would exclude the need of constructing a spacial binding in Rasa. But if this doesn't work then stanza support in Rasa would be more sensible indeed.
@Lindafr I've tried doing it with English yes, but the current Rasa support for spaCy does not assume that we need to pass snlp
when we call spacy.load
. I think that's what is causing some bugs on my side now. It might be like I'm missing a detail but it seems that to support stanza
via this route I'll need to implement a component for spacy-stanza
to handle this.
If I'll end up having a component here for stanza then I prefer to host a direct binding to Rasa. Less things to maintain that way.
I agree. I'll investigate the spacy-stanza
a bit more, but probably it isn't suitable in this case.
@Lindafr out of curiosity. Did you notice an improvement with the stanzatokenizer.py
in Rasa?
@Lindafr not the biggest rush, but did you try the tool on a Rasa project by any chance? If you've experience the merits of it then I might be able to wrap up this feature this week.
I get this error, when running a test project with stanzatokenizer.py
.
The config.yml:
language: "et"
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
- name: CountVectorsFeaturizer
OOV_token: oov.txt
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
max_history: 5
- name: TEDPolicy
epochs: 100
- name: TwoStageFallbackPolicy
nlu_threshold: 0.8
core_threshold: 0.8
fallback_core_action_name: "action_default_fallback"
fallback_nlu_action_name: "action_default_fallback"
deny_suggestion_intent_name: "out_of_scope"
@lindafr It was indeed the same issue. It should now be fixed, the only caveat is that now you need to supply the path to your stanza installation manually. That means you might need to do something like:
language: en
pipeline:
- name: rasa_nlu_examples.tokenizers.StanzaTokenizer
lang: "en"
cache_dir: "tests/data/stanza"
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 1
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
If you are running this locally then usually the stanza files are located in ~/stanza_resources/
. Also, just to check, what platform are you using? Windows/MacOS/Ubuntu?
Ubuntu :)
Just to make sure I did everything accordingly I'll describe here the testing conditions. I used the stanzatokenizer.py
here.
The stanza_test folder looks sth like this: |-PPA_stanza_test/ | | stanzatokenizer.py | | config.yml | | domain.yml | | endpoints.yml | | credentials.yml | | actions.py | |-tests/ | |-results/ | |-data/ | | |-stanza/ | | | |-et/ | | | | |-depparse | | | | |-pos | | | | |-tokenize | | | | |-lemma | | | | |-pretrain | |-models/
then with config.yml
like this:
language: et
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 1
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
My WhiteSpaceTokenizer is in another folder and it's config.yml
looks like this:
language: "et"
pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
OOV_token: oov.txt
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
max_history: 5
- name: TEDPolicy
epochs: 100
- name: TwoStageFallbackPolicy
nlu_threshold: 0.8
core_threshold: 0.8
fallback_core_action_name: "action_default_fallback"
fallback_nlu_action_name: "action_default_fallback"
deny_suggestion_intent_name: "out_of_scope"
I trained new first models and then tested them. I tried to create some harder user messages which won't result in high confidence to better review the confidence differences. Here are the results (for the sake of shortness, I excluded the messages and intent names themselves):
Ideal case | WhiteSpaceTokenizer (WST) | StanzaTokenizer (ST) | Which one is more confident |
---|---|---|---|
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0} | Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98} | Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.136} | WST got it right and is sure of it, stanza got it wrong and is not sure of it. |
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}' | Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.148} | WST is quite confident in a wrong intent, while ST is not confident at all in the right intent. |
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}' | Received user message ''abcde' with intent '{'name': 'Z', 'confidence': 0.122}' | Seems to be a bad example- both configs got it wrong. ST is a bit less confident, which is good. |
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}' | Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13} | WST is right and confident about it, ST is wrong and not confident about it |
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}' | Received user message 'abcdefg' with intent '{'name': 'Y', 'confidence': 0.13} | WST is right and confident about it, ST is wrong and not confident about it |
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}' | Received user message 'a1' with intent '{'name': 'Y', 'confidence': 0.13} | WST is right and quite confident about it, ST is wrong and not confident about it |
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}' | Received user message 'a12' with intent '{'name': 'Y', 'confidence': 0.136} | WST is right and quite confident about it, ST is wrong and not confident about it |
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}' | Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.12} | WST is wrong and quite confident about it, stanza is wrong and not at all confident about it. |
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}' | Received user message 'b1' with intent '{'name': 'Z', 'confidence': 0.115} | WST is wrong and a bit confident about it, stanza is wrong and not at all confident about it. |
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}' | Received user message 'b12' with intent '{'name': 'Z', 'confidence': 0.137}' | WST is right and confident about it, ST is wrong and not confident about it |
As you can see that the pipeline with stanzatokenizer is not confident about anything and usually gets the intent wrong. The WST however usually gets things right, but when it's wrong, it's wrong quite confidently.
The dataset I am working on, has ca 32 intents, some of which have very similar keywords or situations (a la creating a passport application and receiving a passport).
@Lindafr this is very elaborate. Thanks for sharing!
I may have found an issue with your setup though.
language: et
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 1
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
Notice how DIET is only using 1 epoch? This is probably why the setup is underperforming. Also, you've removed the CountVectorizer
which we need to get the lemma
property. It's an undocumented feature but the CountVectorizer
will grab the lemma
if it is available on the token.
Could you try running this?
language: et
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
If you're interested in getting summary statistics, you can also run two config files and get summary statistics. Here's a snippet that might help:
rasa test nlu --config configs/config-light.yml \
--cross-validation --runs 1 --folds 2 \
--out gridresults/config-light
rasa test nlu --config configs/config-heavy.yml \
--cross-validation --runs 1 --folds 2 \
--out gridresults/config-heavy
This will grab two config files (in this case config-light.yml
and config-heavy.yml
and it will save the summary statistics in the gridresults/config-light
and gridresults/config-heavy
folders. You might enjoy using rasalit for this.
Hi, @koaning ,
Based on the results I suspected some kind of a mistake, that's why I wrote about the test conditions (I copy-pased your example and didn't think it through). Today I'll ran it again and then we'll have some real results. It might take a bit time, because I have to do some other things regarding to Rasa due to tomorrow before.
@Lindafr no rush! I'm just super curious of the results. 😄
Renewed results are as follows. (with config proposed earlier): | Ideal case | WhiteSpaceTokenizer (WST) | StanzaTokenizer (ST) | Which one is more confident |
---|---|---|---|---|
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0} | Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98} | Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.95} | WST got it right and is sure of it, stanza got it wrong but very sure of it. | |
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}' | Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.955}' | WST is quite confident in a wrong intent, while ST is confident in the right intent. | |
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}' | Received user message 'abcde' with intent '{'name': 'X', 'confidence': 0.675}' | WST got it wrong. ST is not too confident in the right intent. | |
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}' | Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13} | WST is right and confident about it, ST is wrong and not confident about it | |
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}' | Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.999}' | WST is right and confident about it, ST also right and very confident about it | |
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}' | Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.895}' | Both are right and quite confident about it | |
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}' | Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.999}' | Both are right and quite confident about it, WST is less confident | |
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}' | Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.36} | WST is wrong and a bit confident about it, stanza is wrong in another way and a bit less confident about it. | |
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}' | Received user message 'b1' with intent '{'name': 'X', 'confidence': 0.678}' | WST is wrong and a bit confident about it, stanza is right and quite confident about it. | |
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}' | Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}' | Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.936}' | WST is right and confident about it, ST is right and less confident about it |
As you can see, ST got only 3 times wrong, while WST got 4 utterances wrong. Overall, if you exclude the first example, WST is more confident in it's wrong intents, while stanza is not confident when it's wrong. Stanza seems to be better on this manual check.
Now, for the other statistics.
2020-09-09 14:26:20 INFO rasa.test - Intent evaluation results 2020-09-09 14:26:20 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - test Accuracy: 0.619 (0.023) 2020-09-09 14:26:20 INFO rasa.nlu.test - test F1-score: 0.612 (0.028) 2020-09-09 14:26:20 INFO rasa.nlu.test - test Precision: 0.645 (0.042)
2020-09-09 14:43:59 INFO rasa.test - Intent evaluation results 2020-09-09 14:43:59 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - test Accuracy: 0.596 (0.016) 2020-09-09 14:43:59 INFO rasa.nlu.test - test F1-score: 0.589 (0.009) 2020-09-09 14:43:59 INFO rasa.nlu.test - test Precision: 0.615 (0.009)
One can see that WST's test accuracy, F1 and precision is actually slightly better. Bit suprising as I thought the extra info of POS would help.
With config.yml
(the difference between WST and this is the first element in the pipeline):
language: "et"
pipeline:
- name: stanzatokenizer.StanzaTokenizer
lang: "et"
cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
OOV_token: oov.txt
token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
analyzer: char_wb
min_ngram: 1
max_ngram: 4
- name: LexicalSyntacticFeaturizer
"features": [
["low", "title", "upper"],
["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
["low", "title", "upper"],
]
- name: DIETClassifier
epochs: 200
policies:
- name: MemoizationPolicy
max_history: 5
- name: TEDPolicy
epochs: 100
- name: TwoStageFallbackPolicy
nlu_threshold: 0.8
core_threshold: 0.8
fallback_core_action_name: "action_default_fallback"
fallback_nlu_action_name: "action_default_fallback"
deny_suggestion_intent_name: "out_of_scope"
The results are slightly better than the recommended ST, but still below WTS: 2020-09-09 15:01:52 INFO rasa.test - Intent evaluation results 2020-09-09 15:01:52 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - test Accuracy: 0.593 (0.003) 2020-09-09 15:01:52 INFO rasa.nlu.test - test F1-score: 0.593 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - test Precision: 0.631 (0.002)
Interesting. Looking at these results I wonder about another issue. The difference between the train/test results in both the Stanza
and Whitespace
scenario is huge. Especially because the training accuracy is 100% I'm thinking that we're overfitting in both examples here.
If you're up for it, could you try tuning the epochs down to maybe 50 for DIET? If you still see a big difference between train/test feel free to tune it down even further to 25. The goal is to manually stop the algorithm before it starts overfitting.
@Lindafr is your dataset publicly available? I might also be able to run some benchmarks on your behalf if you're interested. The research team here would love to have an Estonian dataset to benchmark our algorithms on.
Some more statistics on basically the same config.yml
files.
(config.yml
is as the Extra in here.)
with DIET epoch 100
2020-09-10 09:31:00 INFO rasa.test - Intent evaluation results
2020-09-10 09:31:00 INFO rasa.nlu.test - train Accuracy: 0.984 (0.005)
2020-09-10 09:31:00 INFO rasa.nlu.test - train F1-score: 0.983 (0.006)
2020-09-10 09:31:00 INFO rasa.nlu.test - train Precision: 0.984 (0.007)
2020-09-10 09:31:00 INFO rasa.nlu.test - test Accuracy: 0.536 (0.023)
2020-09-10 09:31:00 INFO rasa.nlu.test - test F1-score: 0.508 (0.022)
2020-09-10 09:31:00 INFO rasa.nlu.test - test Precision: 0.570 (0.018)
with DIET epoch 50 2020-09-10 09:10:10 INFO rasa.test - Intent evaluation results 2020-09-10 09:10:10 INFO rasa.nlu.test - train Accuracy: 0.803 (0.052) 2020-09-10 09:10:10 INFO rasa.nlu.test - train F1-score: 0.775 (0.055) 2020-09-10 09:10:10 INFO rasa.nlu.test - train Precision: 0.835 (0.009) 2020-09-10 09:10:10 INFO rasa.nlu.test - test Accuracy: 0.500 (0.034) 2020-09-10 09:10:10 INFO rasa.nlu.test - test F1-score: 0.454 (0.041) 2020-09-10 09:10:10 INFO rasa.nlu.test - test Precision: 0.570 (0.021)
with DIET epoch 25 2020-09-10 09:14:20 INFO rasa.test - Intent evaluation results 2020-09-10 09:14:20 INFO rasa.nlu.test - train Accuracy: 0.650 (0.060) 2020-09-10 09:14:20 INFO rasa.nlu.test - train F1-score: 0.614 (0.066) 2020-09-10 09:14:20 INFO rasa.nlu.test - train Precision: 0.725 (0.071) 2020-09-10 09:14:20 INFO rasa.nlu.test - test Accuracy: 0.329 (0.003) 2020-09-10 09:14:20 INFO rasa.nlu.test - test F1-score: 0.295 (0.021) 2020-09-10 09:14:20 INFO rasa.nlu.test - test Precision: 0.367 (0.065)
with DIET epoch 15 2020-09-10 09:17:48 INFO rasa.test - Intent evaluation results 2020-09-10 09:17:48 INFO rasa.nlu.test - train Accuracy: 0.482 (0.109) 2020-09-10 09:17:48 INFO rasa.nlu.test - train F1-score: 0.447 (0.123) 2020-09-10 09:17:48 INFO rasa.nlu.test - train Precision: 0.578 (0.104) 2020-09-10 09:17:48 INFO rasa.nlu.test - test Accuracy: 0.298 (0.060) 2020-09-10 09:17:48 INFO rasa.nlu.test - test F1-score: 0.238 (0.072) 2020-09-10 09:17:48 INFO rasa.nlu.test - test Precision: 0.314 (0.057)
(config.yml
is as the described in here.)
with DIET epoch 100
2020-09-10 09:38:18 INFO rasa.test - Intent evaluation results
2020-09-10 09:38:18 INFO rasa.nlu.test - train Accuracy: 0.992 (0.008)
2020-09-10 09:38:18 INFO rasa.nlu.test - train F1-score: 0.992 (0.008)
2020-09-10 09:38:18 INFO rasa.nlu.test - train Precision: 0.993 (0.007)
2020-09-10 09:38:18 INFO rasa.nlu.test - test Accuracy: 0.562 (0.018)
2020-09-10 09:38:18 INFO rasa.nlu.test - test F1-score: 0.545 (0.020)
2020-09-10 09:38:18 INFO rasa.nlu.test - test Precision: 0.570 (0.013)
with DIET epoch 50 2020-09-10 10:11:53 INFO rasa.test - Intent evaluation results 2020-09-10 10:11:53 INFO rasa.nlu.test - train Accuracy: 0.889 (0.013) 2020-09-10 10:11:53 INFO rasa.nlu.test - train F1-score: 0.868 (0.026) 2020-09-10 10:11:53 INFO rasa.nlu.test - train Precision: 0.878 (0.046) 2020-09-10 10:11:53 INFO rasa.nlu.test - test Accuracy: 0.487 (0.016) 2020-09-10 10:11:53 INFO rasa.nlu.test - test F1-score: 0.443 (0.012) 2020-09-10 10:11:53 INFO rasa.nlu.test - test Precision: 0.491 (0.047)
with DIET epoch 25 2020-09-10 10:13:59 INFO rasa.test - Intent evaluation results 2020-09-10 10:13:59 INFO rasa.nlu.test - train Accuracy: 0.754 (0.018) 2020-09-10 10:13:59 INFO rasa.nlu.test - train F1-score: 0.733 (0.024) 2020-09-10 10:13:59 INFO rasa.nlu.test - train Precision: 0.849 (0.028) 2020-09-10 10:13:59 INFO rasa.nlu.test - test Accuracy: 0.412 (0.013) 2020-09-10 10:13:59 INFO rasa.nlu.test - test F1-score: 0.357 (0.011) 2020-09-10 10:13:59 INFO rasa.nlu.test - test Precision: 0.452 (0.027)
with DIET epoch 15 2020-09-10 10:17:19 INFO rasa.test - Intent evaluation results 2020-09-10 10:17:19 INFO rasa.nlu.test - train Accuracy: 0.490 (0.085) 2020-09-10 10:17:19 INFO rasa.nlu.test - train F1-score: 0.479 (0.052) 2020-09-10 10:17:19 INFO rasa.nlu.test - train Precision: 0.654 (0.042) 2020-09-10 10:17:19 INFO rasa.nlu.test - test Accuracy: 0.238 (0.098) 2020-09-10 10:17:19 INFO rasa.nlu.test - test F1-score: 0.205 (0.099) 2020-09-10 10:17:19 INFO rasa.nlu.test - test Precision: 0.262 (0.107)
The overfitting is still quite large. I think it's partly due to my still very small dataset - every intent (32 total) I have has only 10-20 example questions and there's not much to learn from. I have plans to make it larger with test users and by hand, but right now that's all I got.
@koaning , the dataset is not publicy available and I cannot share it.
Do you think part of the overfitting results can be due to my small dataset? I can try again later when I have more examples, but right now the results are like this and adding POS doesn't seem to add any improvement (?).
@Lindafr yeah those last results seem convincing. I'll add the feature just in case so that other folks can try it out but it seems safe to say that on your use-case right now it doesn't contribute anything substantial.
Thanks a lot for the feedback though 👍! It's been very(!) helpful.
I'll merge the stanza feature today. It might prove useful to other folks and you can also check if it boosts performance once you've got more data. I'll close this issue once the feature is merged.
One last question out of curiosity, did you try the pretrained embeddings (BytePair/FastText)? If so, I'd love to hear if they were helpful.
https://github.com/RasaHQ/rasa-nlu-examples/pull/30 has been merged! Closing this issue but we can continue/pick up the conversation in this issue if there's a related discussion to be continued.
Just wanted to mention it here. Full stanza support is now coming to spaCy https://explosion.ai/blog/spacy-v3-nightly.
@koaning , are you sure? They don't mention it anywhere. They only compare Stanza's NER results with theirs in one table. Plus they do have the spacy-stanza
package we tried, but failed to integrate with Rasa.
Btw, is there anyway to use stanzatokenizer.py
for getting POS and lemma and also use SpacyNLP for getting the Fasttext embedding? There can't be two tokenizers in the same pipeline and SpacyNLP
+stanzatokenizer.StanzaTokenizer
doesn't work. I guess not, but I am asking anyways just in case there is some obvious solution I didn't think of.
I recall reading it elsewhere that a tighter integration was now possible but I can't find the link anymore ... will look again!
If you want FastText embeddings, can't you use the ones that are available directly in this repository?
In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.