Intent classification error : Asks for atleast 2 classes in the sample when there are 2 samples already

shuvayan commented 7 years ago

Hello,

I am using the below code to train my model :

from rasa_nlu.converters import load_data
from rasa_nlu.config import RasaNLUConfig
from rasa_nlu.model import Trainer
#from pprint import pprint
#from jsonmerge import merge
#import json

#base = json.loads(open('./data/trainData.json').read())
#head = json.loads(open('./data/testData.json').read())

training_data = load_data('./data/trainData.json')
#pprint(train_data)

#trainer = Trainer(RasaNLUConfig("config_spacy.json"))
trainer = Trainer(RasaNLUConfig("config.json"))
trainer.train(training_data)

and this is the config file I am using:

{
  "backend": "spacy_sklearn",
  "path" : "./models",
  "data" : "./data/trainData.json",
  "pipeline": ["nlp_spacy", "ner_spacy", "ner_synonyms","intent_featurizer_spacy","intent_featurizer_ngrams","intent_classifier_sklearn"]
}

However it throws an error saying:

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

Clearly this is not the case as my training data has more than 2 classes(of intent - buy & explore). A sample of the trainingData:

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "text": "I want a shoe of black color and size 9",
        "intent": "buy",
        "entities": [
          {
            "start": 9,
            "end": 14,
            "value": "shoe ",
            "entity": "product"
          },
          {
            "start": 17,
            "end": 23,
            "value": "black ",
            "entity": "color"
          },
          {
            "start": 38,
            "end": 39,
            "value": "9",
            "entity": "size"
          }
        ]
      },
      {
        "text": "I want a shoe of brand adidas and size 10",
        "intent": "buy",
        "entities": [
          {
            "start": 9,
            "end": 14,
            "value": "shoe ",
            "entity": "product"
          },
          {
            "start": 23,
            "end": 30,
            "value": "adidas ",
            "entity": "brand"
          },
          {
            "start": 39,
            "end": 41,
            "value": "10",
            "entity": "size"
          }
        ]
      },

Can you please help regarding this.

tmbo commented 7 years ago

There should be a statistic printed at the beginning when starting the training, can you post these results here (contains how many intents / entities + samples of each).

shuvayan commented 7 years ago

This is the full transcript of the messages thrown:

C:\Users\shuvayan.das\Downloads\Chatbot_Python\entityRecognition\RASA_NLU>python -m rasa_nlu.train -c config.json INFO:root:Trying to load spacy model with name 'en' INFO:root:Added 'nlp_spacy' to component cache. Key 'nlp_spacy-en'. INFO:root:Training data format at ./data/trainData.json is rasa_nlu INFO:root:Training data stats:

intent examples: 32 (2 distinct intents)
found intents: buy, explore
entity examples: 32 (7 distinct entities)
found entities: brand, color, gender, price, product, product-sub-category, size

INFO:root:Starting to train component nlp_spacy INFO:root:Finished training component. INFO:root:Starting to train component ner_spacy INFO:root:Finished training component. INFO:root:Starting to train component ner_synonyms INFO:root:Finished training component. INFO:root:Starting to train component intent_featurizer_spacy INFO:root:Finished training component. INFO:root:Starting to train component intent_featurizer_ngrams C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packages\rasa _nlu\featurizers\ngram_featurizer.py:175: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index sentences = np.array(sentences)[mask] C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packages\rasa _nlu\featurizers\ngram_featurizer.py:176: FutureWarning: in the future, boolean array-likes will be handled as a boolean array index labels = np.array(labels)[mask] Traceback (most recent call last): File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\runpy.py", line 184, in _run_module_as_main "main", mod_spec) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\train.py", line 83, in do_train(config) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\train.py", line 73, in do_train interpreter = trainer.train(training_data) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\model.py", line 157, in train updates = component.train(*args) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\featurizers\ngram_featurizer.py", line 64, in train self.all_ngrams = self._get_best_ngrams(sentences, labels, spacy_nlp) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\featurizers\ngram_featurizer.py", line 127, in _get_best_ngrams return self._sort_applicable_ngrams(ngrams, sentences, labels, spacy_nlp) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\rasa_nlu\featurizers\ngram_featurizer.py", line 184, in _sort_applicable_ngr ams clf.fit(X, y) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\linear_model\randomized_l1.py", line 112, in fit sample_fraction=self.sample_fraction, params) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\memory.py", line 283, in call return self.func(*args, *kwargs) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\linear_model\randomized_l1.py", line 54, in _resamplemodel for in range(n_resampling)): File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\parallel.py", line 758, in call while self.dispatch_one_batch(iterator): File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\parallel.py", line 608, in dispatch_one_batch self._dispatch(tasks) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\parallel.py", line 571, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib_parallel_backends.py", line 109, in apply_async result = ImmediateResult(func) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib_parallel_backends.py", line 326, in init self.results = batch() File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\parallel.py", line 131, in call return [func(args, kwargs) for func, args, kwargs in self.items] File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\externals\joblib\parallel.py", line 131, in return [func(*args, **kwargs) for func, args, kwargs in self.items] File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\linear_model\randomized_l1.py", line 377, in _randomized_logistic clf.fit(X, y) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\linear_model\logistic.py", line 1186, in fit sample_weight=sample_weight) File "C:\Users\shuvayan.das\AppData\Local\Continuum\Anaconda3.3\lib\site-packa ges\sklearn\svm\base.py", line 875, in _fitliblinear " class: %r" % classes[0]) ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0)

It seems the error is in the sklearn liblinear model part and I have found this bug elsewhere : https://github.com/lensacom/sparkit-learn/issues/49

If this is the case I believe random shuffling of the records has to be implemented before feeding the data to the models.

tmbo commented 7 years ago

Right, so this is somewhat tricky. During the training the data gets shuffled and split into multiple cross validation folds. It seems like the splitting creates partial data sets that only contain one of the classes. So you should add more examples, but we should also find away around this (reduce the number of splits?)

shuvayan commented 7 years ago

Yes adding more records will help but I saw somewhere that randomizing the records might help. I will try with more records and let you know the results.

RasaHQ / rasa

Intent classification error : Asks for atleast 2 classes in the sample when there are 2 samples already #359