tensorflow intent classification is very sensititive! What to do?

ctrado18 commented 6 years ago

Hey,

I use for german language the tensorflow embedding.

I recognize a high sensitivity of the intent classifictaion with just slightly changes of sentences like adding just one whitespace between words!
trained in german sentences with just maskulin adjective forms. Testing on feminine adjective forms, which in this case is just removing one letter at the end of the word, changes intent drastically!
My intent greet is just words like hi, hey... Adding hi or hey to one of a well trained sentence, so hi+sentence gives intent greet!

So, I am thinking I have to play with params for my specific use cases to overcome such sensitivity. Can you give me some experience on which params I have to focus?

amn41 commented 6 years ago

for all of there problems It definitely looks like you are overfitting. How many epochs are you training for? do you have a separate test set?

for 3. you could try creating a multi-intent

Ghostvv commented 6 years ago

@ctrado18 could you please add a bit more details:

what do you mean by just one whitespace between words: there was a whitespace and you add the second (that should not change anything) or there was a long German word and you split it into two (in this case you create completely new words that might even be out-of-vocabulary because the classifier by default uses whitespace tokenizer, so it ignores them during prediction)?
Removing one letter at the end of the word creates completely different word from the point of view of this classifier! So no wonder it performs differently.

In order to solve it, add such training data or consider to use some lemmatizer for preprocessing.

ctrado18 commented 6 years ago

@amn41 Yes, I suppossed so too. This results from the sklearn intent classifier preveously. Now you don't need anymore so much training data, right?

Multi-Inent: Yes, but this is confusing. When I need an intent just for ProductA (just for imagination...) like Costs_ProductA_Scenario1, Costs_ProductB_Scenario2... I need also just the intent Costs but I have no and need no intent for ProductA

Because at one post you said you need to have all subintent separately!

But I don't need to have all subintent..Maybe I want just at one level a distinction but not all all intent levels... Do I need the tokenizer_whitespace? And what is it actually. Maybe that's my issue with whitespaces?

This is my pipleine for version 12.3.

pipeline:
  - name: "nlp_spacy"
  - name: "tokenizer_spacy"
  - name: "ner_crf"
  - name: "ner_synonyms"
  - name: "intent_featurizer_count_vectors"
  - name: "intent_classifier_tensorflow_embedding"
    intent_tokenization_flag: true
    intent_split_symbol: "_"

@Ghostvv Yes lemmatizer...But for german spacy is not really good! Is there anything for german? Also spacy distinguishes betwen upper and lower capital letters for words, but in chatbot you write in small letters typically....

@amn41 @Ghostvv I meant Nouns, not adjectives I think... In german when you remove a letter at the end you can get for some Nouns a feminine instead of a maskulin form...But the base form is the same, so it should not have such a great impact?! That is why we are suing this intent embedding approach, to make it less sensitive to short cuts or typos eg...

akelad commented 6 years ago

how much training data do you have? number of intents, examples per intent etc

Ghostvv commented 6 years ago

@ctrado18 thank you for detailed answer. For multi-intent, strictly speaking you do not need separate data for all subintent, I think it is worth to try as you described without subintent for Product and check the performance. If you use spacy, you do not need tokenizer.

I know that German language have such structure, but intent_classifier_tensorflow_embedding doesn't know it, it builds vocabulary from words you provide, so if even one character is different, it will treat it as different word. If you want to deal with different endings, you could try to create preprocessor that cuts the ending of all words or smth like this. Unfortunately, I do not have any suggestions for you about good lemmatizers for German language. Please note that if you provide spacy, then intent_featurizer_count_vectors uses lemma_ from spacy as tokens. You'll need to overwrite this if you want to use custom preprocessor

ctrado18 commented 6 years ago

I would say I have 500 examples for the one intent and for my other specific intents just 30 examples. I think imbalances are handled well with tensorflow? For the 500 examples I created with some sentence structures with Chatito many exmaples in the way that I just plugged in many entitiy values for alle these sentences. So, this is why I ended up with so many examples. But maybe with tensorflow this not a good approach since the sentence is the same instead of one word?

Some questions the tensorflow algorithm: What happens if I set the intent_tokenization_flag to False? Or is there any differences in both ways how the algo classifies the intents? Afterwards I train my intents with _. Might this fix somehow the senitivity for mixing up intents like hey, I want to buy something without training it? I have the feeling that wenn you use intent tokenization, the algo is more sensitive when you mix up intent in one sentence...

Does the algo really create a vocabulary with only the trained words? So, the algo is really sensitiv to slightly changes of a word? I did'nt thought of that. I thought it uses similiarity measures that similiar words are really similiar...

ctrado18 commented 6 years ago

Please note that if you provide spacy, then intent_featurizer_countvectors uses lemma from spacy as tokens. You'll need to overwrite this if you want to use custom preprocessor

But you need spacy for NER_CRF? What if I want to try sthe standard intent_featurizer_count_vectors ? I am confused...

New words (also slight changes or abbreviation?) have no vector embedding? So for production, preprocssing is really neccessery? intent_sklearn was more robustr if i add a letter or something like that. Actually you have to run another classification algo as prep to find the similiarity between like Ich studiere and ich bin im studium? Otherwise you have to train all words on earth. 😄 Because I have an intent beruf.

But as @amn41 said in his first post, it seems that tensoflow should be able of handling these things like slightly changes. And my issue comes from overfitting!

I'm confused now maybe some expert can go grough my posts. 😄

Ghostvv commented 6 years ago

Does the algo really create a vocabulary with only the trained words? So, the algo is really sensitiv to slightly changes of a word? I did'nt thought of that. - It does precisely that!

So unless you have it in training data, the algorithm does not know a priori any similarity between Ich studiere and ich bin im studium before training. It learns its own embeddings only based on the provided training data.

If in your case spacy-sklearn pipeline performs better, why don't you use it?

about different pipelines, please read the docs: https://nlu.rasa.com/pipeline.html

ctrado18 commented 6 years ago

Thank you. Indeed, sklearn performs better but with lower confidence. I found out that the spacy model for german captures a lot of my specific vocab I think (I just tested a few words). Could you combine both embeddings? One question to spacy model: If a word is inside the model then, are there all other forms of that word (plural etc.) inside?

I recognized that both pipelines are still very sensitiv to white space such that if I plugg in an addditionaly space between two words of a sentence...Bot for my intents where I use round 40 examples, so overfitting should not be an issue?

@Ghostvv Still the pipeline question is open. SInce I use CRF I need somehow spacy.

Is there any difference of the inner working of the algo when I use multiple Intents (flag to true)?

Ghostvv commented 6 years ago

@ctrado18 Could you please explain your whitespace issue. If you already have a whitespace and you add the second one, it shouldn't make any difference. Could you please give an example?

ner_crf without spacy is introduced in the last master

ctrado18 commented 6 years ago

@Ghostvv Is the master a different version? I like to stick first to 12.3. What is tokenizer_whitespace and the difference to tokenizer_spacy and nlp_spacy? Is for ner_crf tokenizer_spacy enough?

Example: I need doctor I need doctor

Add one whitespace before doctor. For me this is sensitive and also changes intent !

But I recognize that it only changes the intent for my 3 somehow similiar intents. Those 3 are thought to hanlde hierarchical intents like where the difference is just the kind of employment. So: I am working and need doctor. versus I study and need doctor. There, plug in before or after the kind of employment changes intent...

DO you know the answer to my question to spacy above? That would help if I stick to intent_sklearn...

Ghostvv commented 6 years ago

yes, for ner_crf tokenizer_spacy is enough.

very strange behavior with whitespace, could you please add examples in German exactly as they are when you experience this problem?

ctrado18 commented 6 years ago

I tried to add more example and I think the whitespace issue is somehow more stable. But I will look at this.

What I still like to know if there is any difference of the algo when or when not using the Multiple Intent flag? Because, the trainig data is the same. I also could have some intents like Greet_Intent1_Intent2 just using normal intent classification and train all these intents. SO, where is the difference when I set the flag for multiple intents to true?! What advantages I have?

And I don't want to train those multiple intents. But I am a little bit sad about this mixing up intent when I use a greeting together with a normal sentence like in my starting post or here https://github.com/RasaHQ/rasa_nlu/issues/1182

I have an intent order, like I want to order something with right intent, but using hey, I want to order something it gets intent greet. So, I have to train those occurences or using multiple Intent, right? But is this a normal behaviour?

I just want to know I this is normal or I am doing wrong with my training data. But I think many have this "greet" issue.

It would be nice if @amn41 @akelad can have a look at. 😄

amn41 commented 6 years ago

@ctrado18 thanks for all the info you provided. @Ghostvv is the expert on the tensorflow pipeline. I think this discussion now contains too many ideas and questions for it to be fully resolved, so I'd propose we close it in favour of more specific issues.

If adding an extra whitespace between two words is really giving different results, that sounds like a bug. If you could create a minimal reproducible example that would be extremely helpful, please create a separate issue for that.

To your earlier comment: spacy_sklearn might be giving lower confidences, but that doesn't mean the model is worse. In fact very very high confidences are a strong signal that you are overfitting.

I see some other questions about the way the tensorflow pipeline works, which I'm sure other people would also be interested in. I think it would be very helpful if you could create a reproducible example of the models behaving differently with and without intent tokenization,

ctrado18 commented 6 years ago

Thanks. I will test more specifically my data. But some questions were specific I think. So, can I conlude there should be no difference with and without intent tokenization?

Still, i have a high sensitivity as I expained on examples before. I have just 5 intent, for the first 500 utterances or more and the other 4 just about 50 utterances. Have you experimented with the embedding to give some rule of thumbs for trainin data? Sensitiviy (change of the intent) when doing a mispelling or append just a single letter to a word is high. Is my training data too low? I think those issues are more a general case for building training data from scratch?

To understand the embedding algo correctly, you need to train actually a large dat set consisting of "every" word since there are many similiar words meaning the same like the words hoch, viel in wie viel kostet or wie hoch sind die Kosten... If you have never used the word hoch (in this context, but maybe in another context) this word will be neglected?

ctrado18 commented 6 years ago

One short question to the intent_spacy_sklearn classifier: What and where is this "spacy_doc"? But I assume that it just contains for every word the word vector coming from spacy. So the features for intent classification are just those word vectors?

I find fastText very interesting and try that too such hta I can compare all 3 methods.

Would be nice if you can go through my points briefly, then we can close.

amn41 commented 6 years ago

One thing you can also try is to change the analyzer in the CountVectorFeaturizer ( which wraps sklearn's CountVectorizer to use character ngrams instead of whole tokens

Ghostvv commented 6 years ago

for the documentation of spacy_doc please refer to spacy documentation: https://spacy.io/api/doc.

about sensitivity, yes it is the problem for embedding classifier, that is difficult to tackle. For German language lemmatization, I found this post https://datascience.blog.wzb.eu/2017/05/19/lemmatization-of-german-language-text/

ctrado18 commented 6 years ago

@amn41 Thanks. And I found out thatyou can leave out the intent label. That is my overfitting problem since I use many times same sentence with different entities to train ner_crf.

To compare intent classification for all 3 methods I want to look if some of my custom words are in those pre trained models. But what the heck is going on here? In spacy it seems every word is inside this model. I tried many words and garbage and the follwoing gives that this word `u'sdasfaf``is inside spacy and is ADJ:

import spacy

nlp = spacy.load('de_core_news_sm')
doc = nlp(u'sdasfaf')

for token in doc:
    print(token.text,token.lemma, token.lemma_, token.pos_)
    print("has_vector:", token.has_vector)

@Ghostvv But, for intent sklearn classification you just use the word vectors? It is not so clear from the code since there is just "spacy_doc". Thanks, I look at this lemmatizer.

Ghostvv commented 6 years ago

yes, the standard pipeline for intent sklearn classification uses just the word vector from spacy doc

ctrado18 commented 6 years ago

@Ghostv thanks. Any idea about my code for checking if it has a vector? Why does garbage yield true?

I just saw in spacy's repo that there are many discussion with issues onto german lemmatizer stuff. I don't know why spacy messes so up for german. look at that:


nlp = spacy.load('de_core_news_sm')
doc = nlp(u'was kostet stein zzcfkfjduhüüü')

for token in doc:
    print(token.text, token.lemma_, token.vector_norm, token.is_oov)
    print("has_vector:", token.has_vector)

gives:

was wer 43.980087 True
has_vector: True
kostet kosten 45.899117 True
has_vector: True
stein stein 44.52947 True
has_vector: True
zzcfkfjduhüüü zzcfkfjduhüüü 39.83339 True
has_vector: True

It gives a norm and says it has vector but also that it is OOV!! So none of the words are in the model, and all are at the same time?!

Ghostvv commented 6 years ago

Interesting observations. I think the problem is in collecting training data and the complexity of German language. For the issue above please ask on spacy forums.

RasaHQ / rasa

tensorflow intent classification is very sensititive! What to do? #1181