Adding Stanza - Githubissues

Lindafr commented 4 years ago

In places where FastText wrapped into spaCy is no use, Stanza comes in handy - it can give us the necessary POS-es and lemmatization. It is, at least, the case for Estonian. Should be also for Finnish, Hebrew, Hindi, Hungarian, Indonesian, Irish, Korean, Latvian, Persian, Telugu, Urdu, etc.

koaning commented 4 years ago

This definitely fits the goal of this package. It's now on the TODO.

Are you interested in collaborating on this one?

Lindafr commented 4 years ago

Hi! Collaboration would be interesting, but I doubt I have enough time. I tried to hack Rasa and add some Stanza Estonian lemmatization straight into SpacyTokenizer, but failed so far. I guess those, who are more familiar with the code, find the task easier.

koaning commented 4 years ago

How about this, around the time that I think I might have something. Could you have a peek and give a review? The main thing I'd like to get a second pair of eyes on is the stanza package because I've never used it.

koaning commented 4 years ago

It feels like the components could be split up though.

There's a potential for a tokeniser. Just to check @Lindafr does Estonian not work with a white-space tokeniser?
There's potential for a POS featurizer. These would be sparse features that would get added.
There's potential for an entity detector as well. Do you have any experience with the quality of it?
There also seems to be dependency parsing and lemmatisation in there but it feels to me like these features are less relevant in a Rasa pipeline.

Lindafr commented 4 years ago

Yes, I can help with reviews! I can also try and answer any questions you have about stanza. I might be able to answer them, since I have used it more.

Estonian works with white-space tokeniser. It is used commonly.

Entity detection with stanza is quite alright and best available resource in a sense that it can be easily used in other applications as well. Several Estonian language technologies that companies in Estonian use have stanza in their tummy.

Lemmatisation is crucial for me and it is the whole reason I started commenting in the blog. Estonian (like many other languages, including Finnish and Hungarian) is mostly agglutinative language. This means that substantive can have 29+-1 different forms and verbs ca 93 different forms (sic!). If I don't lemmatize the text, Rasa thinks all those different forms are separate words and that might increase the need of more data and decrease the efficiency of Rasa. Since we don't have any good lemmatized word vector models trained on big data available, I guessed it would be worth trying to take the FastText (or the BytePair) non-lemmatised text embeddings and take it as one feature and during tokenization replace the word with it's lemma so further in pipeline Rasa has the word's non-lemmatized embedding and sees the lemma.

koaning commented 4 years ago

Interesting!

Out of curiosity. Considering that our pipeline has a CountVectorizer that can also use 2/3/4-grams on a character-level I wonder what can be gained by adding Lemmatisation. In the case of ütlevad and ütlen you might still have the ütl as a common feature, no?

koaning commented 4 years ago

I've just checked with the research team, we actually offer lemmatisation in our CountvectorFeaturizer if you use a spaCy model. This suggests another route that might also be useful ... how can we easily create spaCy-compatible models that rely on stanza as a backend.

Lindafr commented 4 years ago

Regarding the 2/3/4-grams question. Estonian is also a bit fusional language, that means we have verbs which root also changes ("hüpa-ta", "hüppa-n" or "võidel-da", "võitle-n" or "tõmba-n", "tõmma-ta"). We also have tons of compound words. Lemmatisation helps to reduce the set of words or char-grams. It might not be so helpful when using 'char' analyzer, but helps a lot when using 'word' analyzer. Experience so far has shown that char-level CountvectorFeaturizer do not give so good results in Estonian and I planned to start testing Rasa with word-level CountvectorFeaturizer.

Regarding your last question, I don't know. I tried to hack SpacyTokenizer in a way that in function tokenize(self, message: Message, attribute: Text) -> List[Token]: it returns Token() that's content is given by stanza instead. I ended up with another error down the pipeline ("ValueError: Sequence dimensions for sparse and dense features don't coincide in...")

koaning commented 4 years ago

The Rasa word level CountVectorizer uses the .lemma_ if SpacyTokenizer is present. This suggests that if you have a spaCy model with a custom .lemma_ implementation that it should work. I wouldn't know how easy/hard it is to implement this but it seems worth to check out. I'll keep you posted if I learn anything.

koaning commented 4 years ago

@Lindafr I'm wondering what the best approach is here. I can attempt to get stanza into something that is spaCy compatible but with the advent of spaCy 3.0 as well as a lot of details in the "getting it right" department I'm currently leaning towards just making a Rasa component.

Would you agree that the POS tags are probably the most important feature to get started with? I should be able to add these as sparse features for the machine learning pipeline.

Lindafr commented 4 years ago

Hi @koaning , Rasa component would be easy for the end user. I, myself, was thinking down the line of just substituting spaCy .lemma_ values with Stanza values (more like a hack than a beautiful pipeline solution). For me, the most important feature would be lemmas. POS-es (and NERs?) come after that.

koaning commented 4 years ago

I've talked to the research team about lemmas and it seems like we've never really supported them directly. Only indirectly for spaCy pipelines. There's interest in exploring it further but it may take time to get to a consensus on what is the best approach. Until then, I'll keep in mind that when I've got time to work on this; POS is a good candidate to start out with.

koaning commented 4 years ago

I will be starting with a tokenizer first. The reason is that internally, if you add a lemma property to a token, the countvectorfeaturizer is able to pick up the lemma instead of the word.

koaning commented 4 years ago

@Lindafr I have a first implementation up and running and you should be able to play with it. I've only tried it out with English sofar on my local machine but you should be able to use the implementation on the PR that is linked for any supported Stanza language.

If you've got the time and you'd like to play with it. You should be able to download the stanzatokenizer.py file that is in the PR locally and open the tools in jupyter. You should be able to play around with the tools using code similar to;

from stanzatokenizer import StanzaTokenizer
from rasa.nlu.training_data import Message

# You can change the language setting here 
tok = StanzaTokenizer(component_config={"lang": "en"})

# This is a Rasa internal thing, you need to wrap text with an object
m = Message("i am running and giving many greetings")
tok.process(m)

# You should now be able to check the properties of the message. 
[t.text for t in m.as_dict()['tokens']] 
[t.data.get('pos', '') for t in m.as_dict()['tokens']]
[t.lemma for t in m.as_dict()['tokens']]

Per example, the last three lists on my machine were;

['i', 'am', 'running', 'and', 'giving', 'many', 'greetings', '__CLS__']
['PRON', 'AUX', 'VERB', 'CCONJ', 'VERB', 'ADJ', 'NOUN', '']
['i', 'be', 'run', 'and', 'give', 'many', 'greeting', '__CLS__']

The __CLS__ token is also a Rasa internal thing, you should see it make an appearance. This is the token we use internally to represent the entire message. If you can confirm that the results you see there are sensible then I can move on to the next phase of the implementation :)

Lindafr commented 4 years ago

Hi @koaning , I just found this. I didn't know it existed before but it seems that maybe integrating stanza to spaCy is even easier and you won't have to invent a whole new pipeline.

I'll take a look at this stanzatokenizer.py first thing tomorrow morning and then I'll investigate the spacy_stanza module.

koaning commented 4 years ago

If either works, let me know :)

koaning commented 4 years ago

Having glanced at the docs, it seems like the stanza-spacy plugin would be the preferable route. It's even hosted by explosion. Let me know if you're having trouble linking it with Rasa! The docs suggest you should be able to just save the model to disk via

nlp.to_disk("./stanza-spacy-model")

To properly link it you might need to do the same steps as I'm taking here.

Lindafr commented 4 years ago

Hi @koaning ,

Your code works great:

I got an error withstanza-spacy while following it's tutorial, so I haven't had the chance of trying to link it with Rasa yet, because I couldn't even get it to work outside Rasa (nlp = spacy.load("./stanza-spacy-model", snlp=snlp) doesn't work after nlp.to_disk("./stanza-spacy-model")). I'm investigating it atm, but your code gives the right results for Estonian.

koaning commented 4 years ago

Thanks for letting me know. I'm having a lunch break now but I might have some time to have a look at the spaCy bindings for stanza.

Just to check, do you also have a Rasa project? You should now also be able to use that tokenizer in config.yml. The following configuration should automatically pick up the pos and the lemma from the pipeline. The lemma is a bit of a hidden feature of the countvectorizer. This configuration assumes that you've placed the stanzatokenizer.py file in the root of your Rasa project.

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200

If you run it with stanzatokenizer.StanzaTokenizer vs. WhitespaceTokenzier you should get different results in your NLU scores.

koaning commented 4 years ago

I've just tried trying to use spaCy directly and I had some troubles. There's a caveat mentioned here and I'm getting this error when I try to package it all up;

TypeError: __init__() missing 1 required positional argument: 'snlp'

I fear that we'd need to construct a special binding around this spacy-stanza package to get it to work for the Rasa use-case. It's do-able but it might make more sense to build the stanza feature directly for Rasa.

Lindafr commented 4 years ago

Did you try this on English? With Estonian I get the language error that should be solved ("But because this package exposes a spacy_languages entry point in its setup.py that points to StanzaLanguage, spaCy knows how to initialize it.").

import spacy
import stanza
from spacy_stanza import StanzaLanguage
#stanza.download("et")
snlp = stanza.Pipeline(lang="et")
nlp = StanzaLanguage(snlp)
nlp.to_disk("./stanza-spacy-model")
nlp = spacy.load("./stanza-spacy-model", snlp=snlp)

My hope was that this would exclude the need of constructing a spacial binding in Rasa. But if this doesn't work then stanza support in Rasa would be more sensible indeed.

koaning commented 4 years ago

@Lindafr I've tried doing it with English yes, but the current Rasa support for spaCy does not assume that we need to pass snlp when we call spacy.load. I think that's what is causing some bugs on my side now. It might be like I'm missing a detail but it seems that to support stanza via this route I'll need to implement a component for spacy-stanza to handle this.

If I'll end up having a component here for stanza then I prefer to host a direct binding to Rasa. Less things to maintain that way.

Lindafr commented 4 years ago

I agree. I'll investigate the spacy-stanza a bit more, but probably it isn't suitable in this case.

koaning commented 4 years ago

@Lindafr out of curiosity. Did you notice an improvement with the stanzatokenizer.py in Rasa?

koaning commented 4 years ago

@Lindafr not the biggest rush, but did you try the tool on a Rasa project by any chance? If you've experience the merits of it then I might be able to wrap up this feature this week.

Lindafr commented 4 years ago

I get this error, when running a test project with stanzatokenizer.py.

The config.yml:

language: "et"
pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200

policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

koaning commented 4 years ago

@lindafr It was indeed the same issue. It should now be fixed, the only caveat is that now you need to supply the path to your stanza installation manually. That means you might need to do something like:

language: en

pipeline:
- name: rasa_nlu_examples.tokenizers.StanzaTokenizer
  lang: "en"
  cache_dir: "tests/data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

If you are running this locally then usually the stanza files are located in ~/stanza_resources/. Also, just to check, what platform are you using? Windows/MacOS/Ubuntu?

Lindafr commented 4 years ago

Ubuntu :)

Lindafr commented 4 years ago

Just to make sure I did everything accordingly I'll describe here the testing conditions. I used the stanzatokenizer.py here.

The stanza_test folder looks sth like this: |-PPA_stanza_test/ | | stanzatokenizer.py | | config.yml | | domain.yml | | endpoints.yml | | credentials.yml | | actions.py | |-tests/ | |-results/ | |-data/ | | |-stanza/ | | | |-et/ | | | | |-depparse | | | | |-pos | | | | |-tokenize | | | | |-lemma | | | | |-pretrain | |-models/

then with config.ymllike this:

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

My WhiteSpaceTokenizer is in another folder and it's config.ymllooks like this:

language: "et"

pipeline:
- name: WhitespaceTokenizer
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200

policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

I trained new first models and then tested them. I tried to create some harder user messages which won't result in high confidence to better review the confidence differences. Here are the results (for the sake of shortness, I excluded the messages and intent names themselves):

Ideal case	WhiteSpaceTokenizer (WST)	StanzaTokenizer (ST)	Which one is more confident
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0}	Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98}	Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.136}	WST got it right and is sure of it, stanza got it wrong and is not sure of it.
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}'	Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.148}	WST is quite confident in a wrong intent, while ST is not confident at all in the right intent.
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}'	Received user message ''abcde' with intent '{'name': 'Z', 'confidence': 0.122}'	Seems to be a bad example- both configs got it wrong. ST is a bit less confident, which is good.
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}'	Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13}	WST is right and confident about it, ST is wrong and not confident about it
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}'	Received user message 'abcdefg' with intent '{'name': 'Y', 'confidence': 0.13}	WST is right and confident about it, ST is wrong and not confident about it
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}'	Received user message 'a1' with intent '{'name': 'Y', 'confidence': 0.13}	WST is right and quite confident about it, ST is wrong and not confident about it
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}'	Received user message 'a12' with intent '{'name': 'Y', 'confidence': 0.136}	WST is right and quite confident about it, ST is wrong and not confident about it
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}'	Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.12}	WST is wrong and quite confident about it, stanza is wrong and not at all confident about it.
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}'	Received user message 'b1' with intent '{'name': 'Z', 'confidence': 0.115}	WST is wrong and a bit confident about it, stanza is wrong and not at all confident about it.
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}'	Received user message 'b12' with intent '{'name': 'Z', 'confidence': 0.137}'	WST is right and confident about it, ST is wrong and not confident about it

As you can see that the pipeline with stanzatokenizer is not confident about anything and usually gets the intent wrong. The WST however usually gets things right, but when it's wrong, it's wrong quite confidently.

The dataset I am working on, has ca 32 intents, some of which have very similar keywords or situations (a la creating a passport application and receiving a passport).

koaning commented 4 years ago

@Lindafr this is very elaborate. Thanks for sharing!

I may have found an issue with your setup though.

Your current Stanza Config

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 1

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

Notice how DIET is only using 1 epoch? This is probably why the setup is underperforming. Also, you've removed the CountVectorizer which we need to get the lemma property. It's an undocumented feature but the CountVectorizer will grab the lemma if it is available on the token.

My Stanza Config Proposal

Could you try running this?

language: et

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: DIETClassifier
  epochs: 200

policies:
  - name: MemoizationPolicy
  - name: KerasPolicy
  - name: MappingPolicy

If you're interested in getting summary statistics, you can also run two config files and get summary statistics. Here's a snippet that might help:

rasa test nlu --config configs/config-light.yml \
              --cross-validation --runs 1 --folds 2 \
              --out gridresults/config-light
rasa test nlu --config configs/config-heavy.yml \
              --cross-validation --runs 1 --folds 2 \
              --out gridresults/config-heavy

This will grab two config files (in this case config-light.yml and config-heavy.yml and it will save the summary statistics in the gridresults/config-light and gridresults/config-heavy folders. You might enjoy using rasalit for this.

Lindafr commented 4 years ago

Hi, @koaning ,

Based on the results I suspected some kind of a mistake, that's why I wrote about the test conditions (I copy-pased your example and didn't think it through). Today I'll ran it again and then we'll have some real results. It might take a bit time, because I have to do some other things regarding to Rasa due to tomorrow before.

koaning commented 4 years ago

@Lindafr no rush! I'm just super curious of the results. 😄

Lindafr commented 4 years ago

Renewed results are as follows. (with config proposed earlier):	Ideal case	WhiteSpaceTokenizer (WST)	StanzaTokenizer (ST)
Received user message 'abc' with intent '{'name': 'X', 'confidence':1.0}	Received user message 'abc' with intent '{'name': 'X', 'confidence': 0.98}	Received user message 'abc' with intent '{'name': 'Y', 'confidence': 0.95}	WST got it right and is sure of it, stanza got it wrong but very sure of it.
Received user message 'abcd' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcd' with intent '{'name': 'Y', 'confidence': 0.656}'	Received user message 'abcd' with intent '{'name': 'X', 'confidence': 0.955}'	WST is quite confident in a wrong intent, while ST is confident in the right intent.
Received user message 'abcde' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcde' with intent '{'name': 'Y', 'confidence': 0.367}'	Received user message 'abcde' with intent '{'name': 'X', 'confidence': 0.675}'	WST got it wrong. ST is not too confident in the right intent.
Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcdef' with intent '{'name': 'X', 'confidence': 0.97}'	Received user message 'abcdef' with intent '{'name': 'Y', 'confidence': 0.13}	WST is right and confident about it, ST is wrong and not confident about it
Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.97}'	Received user message 'abcdefg' with intent '{'name': 'X', 'confidence': 0.999}'	WST is right and confident about it, ST also right and very confident about it
Received user message 'a1' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.868}'	Received user message 'a1' with intent '{'name': 'X', 'confidence': 0.895}'	Both are right and quite confident about it
Received user message 'a12' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.827}'	Received user message 'a12' with intent '{'name': 'X', 'confidence': 0.999}'	Both are right and quite confident about it, WST is less confident
Received user message 'a123' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'a123' with intent '{'name': 'Y', 'confidence': 0.83}'	Received user message 'a123' with intent '{'name': 'Z', 'confidence': 0.36}	WST is wrong and a bit confident about it, stanza is wrong in another way and a bit less confident about it.
Received user message 'b1' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'b1' with intent '{'name': 'Y', 'confidence': 0.666}'	Received user message 'b1' with intent '{'name': 'X', 'confidence': 0.678}'	WST is wrong and a bit confident about it, stanza is right and quite confident about it.
Received user message 'b12' with intent '{'name': 'X', 'confidence': 1.0}'	Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.998}'	Received user message 'b12' with intent '{'name': 'X', 'confidence': 0.936}'	WST is right and confident about it, ST is right and less confident about it

As you can see, ST got only 3 times wrong, while WST got 4 utterances wrong. Overall, if you exclude the first example, WST is more confident in it's wrong intents, while stanza is not confident when it's wrong. Stanza seems to be better on this manual check.

Now, for the other statistics.

WhiteSpaceTokeniser

2020-09-09 14:26:20 INFO rasa.test - Intent evaluation results 2020-09-09 14:26:20 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 14:26:20 INFO rasa.nlu.test - test Accuracy: 0.619 (0.023) 2020-09-09 14:26:20 INFO rasa.nlu.test - test F1-score: 0.612 (0.028) 2020-09-09 14:26:20 INFO rasa.nlu.test - test Precision: 0.645 (0.042)

StanzaTokeniser

2020-09-09 14:43:59 INFO rasa.test - Intent evaluation results 2020-09-09 14:43:59 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 14:43:59 INFO rasa.nlu.test - test Accuracy: 0.596 (0.016) 2020-09-09 14:43:59 INFO rasa.nlu.test - test F1-score: 0.589 (0.009) 2020-09-09 14:43:59 INFO rasa.nlu.test - test Precision: 0.615 (0.009)

Summary

One can see that WST's test accuracy, F1 and precision is actually slightly better. Bit suprising as I thought the extra info of POS would help.

Extra

With config.yml (the difference between WST and this is the first element in the pipeline):

language: "et"

pipeline:
- name: stanzatokenizer.StanzaTokenizer
  lang: "et"
  cache_dir: "data/stanza"
- name: CountVectorsFeaturizer
  OOV_token: oov.txt
  token_pattern: (?u)\b\w+\b
- name: CountVectorsFeaturizer
  analyzer: char_wb
  min_ngram: 1
  max_ngram: 4
- name: LexicalSyntacticFeaturizer
  "features": [
    ["low", "title", "upper"],
    ["BOS", "EOS", "low", "upper", "title", "digit", "pos"],
    ["low", "title", "upper"],
  ]
- name: DIETClassifier
  epochs: 200

policies:
- name: MemoizationPolicy
  max_history: 5
- name: TEDPolicy
  epochs: 100
- name: TwoStageFallbackPolicy
  nlu_threshold: 0.8
  core_threshold: 0.8
  fallback_core_action_name: "action_default_fallback"
  fallback_nlu_action_name: "action_default_fallback"
  deny_suggestion_intent_name: "out_of_scope"

The results are slightly better than the recommended ST, but still below WTS: 2020-09-09 15:01:52 INFO rasa.test - Intent evaluation results 2020-09-09 15:01:52 INFO rasa.nlu.test - train Accuracy: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - train F1-score: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - train Precision: 1.000 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - test Accuracy: 0.593 (0.003) 2020-09-09 15:01:52 INFO rasa.nlu.test - test F1-score: 0.593 (0.000) 2020-09-09 15:01:52 INFO rasa.nlu.test - test Precision: 0.631 (0.002)

koaning commented 4 years ago

Interesting. Looking at these results I wonder about another issue. The difference between the train/test results in both the Stanza and Whitespace scenario is huge. Especially because the training accuracy is 100% I'm thinking that we're overfitting in both examples here.

If you're up for it, could you try tuning the epochs down to maybe 50 for DIET? If you still see a big difference between train/test feel free to tune it down even further to 25. The goal is to manually stop the algorithm before it starts overfitting.

koaning commented 4 years ago

@Lindafr is your dataset publicly available? I might also be able to run some benchmarks on your behalf if you're interested. The research team here would love to have an Estonian dataset to benchmark our algorithms on.

Lindafr commented 4 years ago

Some more statistics on basically the same config.yml files.

StanzaTokenizer

(config.yml is as the Extra in here.) with DIET epoch 100 2020-09-10 09:31:00 INFO rasa.test - Intent evaluation results 2020-09-10 09:31:00 INFO rasa.nlu.test - train Accuracy: 0.984 (0.005) 2020-09-10 09:31:00 INFO rasa.nlu.test - train F1-score: 0.983 (0.006) 2020-09-10 09:31:00 INFO rasa.nlu.test - train Precision: 0.984 (0.007) 2020-09-10 09:31:00 INFO rasa.nlu.test - test Accuracy: 0.536 (0.023) 2020-09-10 09:31:00 INFO rasa.nlu.test - test F1-score: 0.508 (0.022) 2020-09-10 09:31:00 INFO rasa.nlu.test - test Precision: 0.570 (0.018)

with DIET epoch 50 2020-09-10 09:10:10 INFO rasa.test - Intent evaluation results 2020-09-10 09:10:10 INFO rasa.nlu.test - train Accuracy: 0.803 (0.052) 2020-09-10 09:10:10 INFO rasa.nlu.test - train F1-score: 0.775 (0.055) 2020-09-10 09:10:10 INFO rasa.nlu.test - train Precision: 0.835 (0.009) 2020-09-10 09:10:10 INFO rasa.nlu.test - test Accuracy: 0.500 (0.034) 2020-09-10 09:10:10 INFO rasa.nlu.test - test F1-score: 0.454 (0.041) 2020-09-10 09:10:10 INFO rasa.nlu.test - test Precision: 0.570 (0.021)

with DIET epoch 25 2020-09-10 09:14:20 INFO rasa.test - Intent evaluation results 2020-09-10 09:14:20 INFO rasa.nlu.test - train Accuracy: 0.650 (0.060) 2020-09-10 09:14:20 INFO rasa.nlu.test - train F1-score: 0.614 (0.066) 2020-09-10 09:14:20 INFO rasa.nlu.test - train Precision: 0.725 (0.071) 2020-09-10 09:14:20 INFO rasa.nlu.test - test Accuracy: 0.329 (0.003) 2020-09-10 09:14:20 INFO rasa.nlu.test - test F1-score: 0.295 (0.021) 2020-09-10 09:14:20 INFO rasa.nlu.test - test Precision: 0.367 (0.065)

with DIET epoch 15 2020-09-10 09:17:48 INFO rasa.test - Intent evaluation results 2020-09-10 09:17:48 INFO rasa.nlu.test - train Accuracy: 0.482 (0.109) 2020-09-10 09:17:48 INFO rasa.nlu.test - train F1-score: 0.447 (0.123) 2020-09-10 09:17:48 INFO rasa.nlu.test - train Precision: 0.578 (0.104) 2020-09-10 09:17:48 INFO rasa.nlu.test - test Accuracy: 0.298 (0.060) 2020-09-10 09:17:48 INFO rasa.nlu.test - test F1-score: 0.238 (0.072) 2020-09-10 09:17:48 INFO rasa.nlu.test - test Precision: 0.314 (0.057)

WhiteSpaceTokenizer

(config.yml is as the described in here.) with DIET epoch 100 2020-09-10 09:38:18 INFO rasa.test - Intent evaluation results 2020-09-10 09:38:18 INFO rasa.nlu.test - train Accuracy: 0.992 (0.008) 2020-09-10 09:38:18 INFO rasa.nlu.test - train F1-score: 0.992 (0.008) 2020-09-10 09:38:18 INFO rasa.nlu.test - train Precision: 0.993 (0.007) 2020-09-10 09:38:18 INFO rasa.nlu.test - test Accuracy: 0.562 (0.018) 2020-09-10 09:38:18 INFO rasa.nlu.test - test F1-score: 0.545 (0.020) 2020-09-10 09:38:18 INFO rasa.nlu.test - test Precision: 0.570 (0.013)

with DIET epoch 50 2020-09-10 10:11:53 INFO rasa.test - Intent evaluation results 2020-09-10 10:11:53 INFO rasa.nlu.test - train Accuracy: 0.889 (0.013) 2020-09-10 10:11:53 INFO rasa.nlu.test - train F1-score: 0.868 (0.026) 2020-09-10 10:11:53 INFO rasa.nlu.test - train Precision: 0.878 (0.046) 2020-09-10 10:11:53 INFO rasa.nlu.test - test Accuracy: 0.487 (0.016) 2020-09-10 10:11:53 INFO rasa.nlu.test - test F1-score: 0.443 (0.012) 2020-09-10 10:11:53 INFO rasa.nlu.test - test Precision: 0.491 (0.047)

with DIET epoch 25 2020-09-10 10:13:59 INFO rasa.test - Intent evaluation results 2020-09-10 10:13:59 INFO rasa.nlu.test - train Accuracy: 0.754 (0.018) 2020-09-10 10:13:59 INFO rasa.nlu.test - train F1-score: 0.733 (0.024) 2020-09-10 10:13:59 INFO rasa.nlu.test - train Precision: 0.849 (0.028) 2020-09-10 10:13:59 INFO rasa.nlu.test - test Accuracy: 0.412 (0.013) 2020-09-10 10:13:59 INFO rasa.nlu.test - test F1-score: 0.357 (0.011) 2020-09-10 10:13:59 INFO rasa.nlu.test - test Precision: 0.452 (0.027)

with DIET epoch 15 2020-09-10 10:17:19 INFO rasa.test - Intent evaluation results 2020-09-10 10:17:19 INFO rasa.nlu.test - train Accuracy: 0.490 (0.085) 2020-09-10 10:17:19 INFO rasa.nlu.test - train F1-score: 0.479 (0.052) 2020-09-10 10:17:19 INFO rasa.nlu.test - train Precision: 0.654 (0.042) 2020-09-10 10:17:19 INFO rasa.nlu.test - test Accuracy: 0.238 (0.098) 2020-09-10 10:17:19 INFO rasa.nlu.test - test F1-score: 0.205 (0.099) 2020-09-10 10:17:19 INFO rasa.nlu.test - test Precision: 0.262 (0.107)

The overfitting is still quite large. I think it's partly due to my still very small dataset - every intent (32 total) I have has only 10-20 example questions and there's not much to learn from. I have plans to make it larger with test users and by hand, but right now that's all I got.

Lindafr commented 4 years ago

@koaning , the dataset is not publicy available and I cannot share it.

Do you think part of the overfitting results can be due to my small dataset? I can try again later when I have more examples, but right now the results are like this and adding POS doesn't seem to add any improvement (?).

koaning commented 4 years ago

@Lindafr yeah those last results seem convincing. I'll add the feature just in case so that other folks can try it out but it seems safe to say that on your use-case right now it doesn't contribute anything substantial.

Thanks a lot for the feedback though 👍! It's been very(!) helpful.

I'll merge the stanza feature today. It might prove useful to other folks and you can also check if it boosts performance once you've got more data. I'll close this issue once the feature is merged.

One last question out of curiosity, did you try the pretrained embeddings (BytePair/FastText)? If so, I'd love to hear if they were helpful.

koaning commented 4 years ago

https://github.com/RasaHQ/rasa-nlu-examples/pull/30 has been merged! Closing this issue but we can continue/pick up the conversation in this issue if there's a related discussion to be continued.

koaning commented 3 years ago

Just wanted to mention it here. Full stanza support is now coming to spaCy https://explosion.ai/blog/spacy-v3-nightly.

Lindafr commented 3 years ago

@koaning , are you sure? They don't mention it anywhere. They only compare Stanza's NER results with theirs in one table. Plus they do have the spacy-stanza package we tried, but failed to integrate with Rasa.

Btw, is there anyway to use stanzatokenizer.py for getting POS and lemma and also use SpacyNLP for getting the Fasttext embedding? There can't be two tokenizers in the same pipeline and SpacyNLP+stanzatokenizer.StanzaTokenizer doesn't work. I guess not, but I am asking anyways just in case there is some obvious solution I didn't think of.

koaning commented 3 years ago

I recall reading it elsewhere that a tighter integration was now possible but I can't find the link anymore ... will look again!

If you want FastText embeddings, can't you use the ones that are available directly in this repository?

RasaHQ / rasa-nlu-examples

Adding Stanza #23

Your current Stanza Config

My Stanza Config Proposal

WhiteSpaceTokeniser

StanzaTokeniser

Summary

Extra

StanzaTokenizer

WhiteSpaceTokenizer