deeppavlov / DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.
https://deeppavlov.ai
Apache License 2.0
6.71k stars 1.15k forks source link

How to get "none of these classes" output for cnn_model classifier? #555

Closed takiholadi closed 5 years ago

takiholadi commented 6 years ago

I am doing some text classifier using cnn_model from DeepPavlov.

Let's say I have three categories [order_pizza, order_burger, order_pepsi]. Now if model predicts an object, it always share 1.0 between these classes. Even some trash messages get high scores.

What is the right way to get "none of these classes" output? Should I add "other" class in the training, or use threshold somehow?

dilyararimovna commented 6 years ago

Hello! Such a task is about multi-label classification (any number of labels including zero). To make it, please, provide binary_crossentropy as a loss function, sigmoid as a last layer activation, and confident_threshold value to the component proba2labels (labels with probability higher than threshold will be chosen).

takiholadi commented 5 years ago

@dilyararimovna, good news for me that it's possible to turn it into multilabel classification! Maybe it's better to provide such how-to in the documentation section "Classification models in DeepPavlov".

Could you please clearify how to set it, based on rusentiment_cnn.json? Here is the part of the config:

        "filters_cnn": 256,
        "optimizer": "Adam",
        "learning_rate": 0.01,
        "learning_rate_decay": 0.1,
        "loss": "binary_crossentropy",
        "text_size": 40,
        "last_layer_activation": "sigmoid",
        "coef_reg_cnn": 1e-3,
        "coef_reg_den": 1e-2,
        "dropout_rate": 0.5,
        "dense_size": 100,
        "model_name": "cnn_model"
      },
      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "name": "proba2labels",
        "confident_threshold": 0.5,
        "max_proba": true
      },

I am not sure about confident_threshold/proba2labels/max_proba. Have I set them correctly?

dilyararimovna commented 5 years ago

Hello! Parameters for keras_classification_model are correct while for proba2labels you should give only one of alternative parameters, e.g confident_threshold

takiholadi commented 5 years ago

I will use

      {
        "in": "y_pred_probas",
        "out": "y_pred_ids",
        "name": "proba2labels",
        "confident_threshold": 0.5
      },

Thank you!

takiholadi commented 5 years ago

@dilyararimovna I am stuck on this:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

How can I make my data readable for multi-label case?

I use JSON-config with basic_classification_reader. It reads CSV files where "y" column is intent. I tried to use lists for each "y" cell, but did not succeed.

dilyararimovna commented 5 years ago

Documentation contains section "Train on other datasets". There is an example how the dataset can look for multi-label classification.

takiholadi commented 5 years ago

Documentation contains section "Train on other datasets". There is an example how the dataset can look for multi-label classification.

Yes, my dataset is correctly prepared.

I can run classification with DeepPavlov 0.0.8 for old config. But DeepPavlov 0.0.9 with new-styled config still gives me same ValueError:


ValueError Traceback (most recent call last)

in ----> 1 train_evaluate_model_from_config(INTENT_CONFIG_PATH) ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/deeppavlov/core/commands/train.py in train_evaluate_model_from_config(config, iterator, to_train, to_validate) 201 202 if callable(getattr(model, 'train_on_batch', None)): --> 203 _train_batches(model, iterator, train_config, metrics_functions) 204 elif callable(getattr(model, 'fit_batches', None)): 205 _fit_batches(model, iterator, train_config) ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/deeppavlov/core/commands/train.py in _train_batches(model, iterator, train_config, metrics_functions) 348 if train_config['val_every_n_epochs'] > 0 and epochs % train_config['val_every_n_epochs'] == 0: 349 report = _test_model(model, metrics_functions, iterator, --> 350 train_config['batch_size'], 'valid', start_time, train_config['show_examples']) 351 report['epochs_done'] = epochs 352 report['batches_seen'] = i ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/deeppavlov/core/commands/train.py in _test_model(model, metrics_functions, iterator, batch_size, data_type, start_time, show_examples) 266 out += list(val) 267 --> 268 metrics = [(m.name, m.fn(*[outputs[i] for i in m.inputs])) for m in metrics_functions] 269 270 report = { ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/deeppavlov/core/commands/train.py in (.0) 266 out += list(val) 267 --> 268 metrics = [(m.name, m.fn(*[outputs[i] for i in m.inputs])) for m in metrics_functions] 269 270 report = { ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/deeppavlov/metrics/fmeasure.py in round_f1_macro(y_true, y_predicted) 74 predictions = y_predicted 75 ---> 76 return f1_score(np.array(y_true), np.array(predictions), average="macro") 77 78 ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/sklearn/metrics/classification.py in f1_score(y_true, y_pred, labels, pos_label, average, sample_weight) 712 return fbeta_score(y_true, y_pred, 1, labels=labels, 713 pos_label=pos_label, average=average, --> 714 sample_weight=sample_weight) 715 716 ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/sklearn/metrics/classification.py in fbeta_score(y_true, y_pred, beta, labels, pos_label, average, sample_weight) 826 average=average, 827 warn_for=('f-score',), --> 828 sample_weight=sample_weight) 829 return f 830 ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/sklearn/metrics/classification.py in precision_recall_fscore_support(y_true, y_pred, beta, labels, pos_label, average, warn_for, sample_weight) 1023 raise ValueError("beta should be >0 in the F-beta score") 1024 -> 1025 y_type, y_true, y_pred = _check_targets(y_true, y_pred) 1026 present_labels = unique_labels(y_true, y_pred) 1027 ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/sklearn/metrics/classification.py in _check_targets(y_true, y_pred) 71 check_consistent_length(y_true, y_pred) 72 type_true = type_of_target(y_true) ---> 73 type_pred = type_of_target(y_pred) 74 75 y_type = set([type_true, type_pred]) ~/projects/intents_classifier_deeppavlov10/env/lib/python3.6/site-packages/sklearn/utils/multiclass.py in type_of_target(y) 261 if (not hasattr(y[0], '__array__') and isinstance(y[0], Sequence) 262 and not isinstance(y[0], string_types)): --> 263 raise ValueError('You appear to be using a legacy multi-label data' 264 ' representation. Sequence of sequences are no' 265 ' longer supported; use a binary array or sparse' ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.
dilyararimovna commented 5 years ago

Hello! This error of f1_score is caused by the fact that you are trying to give in f1_score as input list of labels but you have to provide it with label-indicator arrays (this is required by sklearn). So, just try to choose in config["train"]["metrics"] metric f1_macro and pass to it one-hot representations of true and predicted labels:

"metrics": [
      {
        "name": "sets_accuracy",
        "inputs": [
          "y",
          "y_pred_labels"
        ]
      },
      {
        "name": "f1_macro",
        "inputs": [
          "y_onehot",
          "y_pred_onehot"
        ]
      },
      {
        "name": "roc_auc",
        "inputs": [
          "y_onehot",
          "y_pred_probas"
        ]
      }
    ]

Trying to reproduce this error on DSTC 2 we found another bug. Thank you! For your information, the bug is following. If you have labels that do not appear in train set but appear in valid or test sets, you have to provide unknown token for classes vocabulary to provide an opportunity for model to set an unknown class for unseen before labels. For example,

{
        "id": "classes_vocab",
        "name": "simple_vocab",
        "fit_on": [
          "y"
        ],
        "save_path": "classifiers/intents_dstc2_v6/classes.dict",
        "load_path": "classifiers/intents_dstc2_v6/classes.dict",
        "in": "y",
        "out": "y_ids",
        "special_tokens": ["<UNK>"]
      }