deeppavlov / DeepPavlov

An open source library for deep learning end-to-end dialog systems and chatbots.
https://deeppavlov.ai
Apache License 2.0
6.72k stars 1.15k forks source link

Question: How to use the model for my own task? #158

Closed aCombray closed 6 years ago

aCombray commented 6 years ago

I want to use the model go_bot for my own task. In particular, the model is similar to configs/go_bot/gobot_dstc2.json. But I want to use my own dialogue data and slots. My question is what files do I need to provide? I understand I need to provide my own configuration in deeppavlov/configs/go_bot/, my own dataset_reader, my own tracker. But how to provide the data for the slot_filler? It would be great if you could specify what files are needed in which directory. Could anyone help me? Thanks.

mu-arkhipov commented 6 years ago

Hi there! The basic idea behind the slot_filler is entity extraction via neural network which output is chunks of the text with defined category (entity type). These chunks are then passed to the fuzzy search based on Levenshtein distance. So there are essentially two phases of slot filling and two types of data needed. Below I will describe the simple example.

Suppose there are two types of entities and three entities for each type. Let the entity types be: "food" and "location". Possible values of "food" be {"korean", "chinese", "russian traditional"}, and possible values for "location" are {"center", "suburbs"}. 1) Named Entity Recognition For named entity recognition you have to provide the dataset in the following format:

Restaraunt  O
in          O
the         O
center        B-LOC
of          O
the         O
city        O
serving     O
russian      B-FOOD
traditional    I-FOOD
cuisine     O

Please refer to the deeppavlov/models/ner/README.md and deeppavlov/models/ner/README_NER.md to get started. Create the dataset in the described in README and use "conll2003_reader" as the dataset_reader. Then train the network.

2) SlotFilling The slotfilling is the task of matching some entity with one from predefined entity set. For this purpose we provide the dictionary of possible variations of the entity in the following form:

{
    "location": {
        "center": [
            "downtown",
            "center"
        ],
        "suburbs": [
            "outskirts",
            "suburbs"
        ],
    "food": { 
...

This dictionary should be provided in the slot_vals.json.

aCombray commented 6 years ago

Hello. Thanks for the detailed explanation. I have implemented a slot_filler that only uses Levenshtein distance to extract possible entities. I think the data required by NER part might be difficult to obtain for a lot of tasks. So my simpler slot_filler might also be helpful. Can I do pull request and you can decide if you want to include it in DeepPavlov?

vaibhavgeek commented 6 years ago

Hi, @aCombray Can you please share your solution. I am looking for an example that is not centred around food ordering and at the same time is goal oriented. It's is little hard to follow the documentation. It would be great if you could share the config file that you have made along with the respective data files.

Regards,

aCombray commented 6 years ago

Yes. I have left the project and I don't remember the details but the following json file is the config. The slots are for the information needed to answer drug questions. The hard part, as I remember, is to provide the training data with the right tags. I hope it is helpful.

{
  "dataset_reader": {
    "name": "pharma_reader",
    "data_path": "pharma"
  },
  "dataset_iterator": {
    "name": "dialog_iterator"
  },
  "chainer": {
    "in": [
      "x"
    ],
    "in_y": [
      "y"
    ],
    "out": [
      "y_predicted"
    ],
    "pipe": [
      {
        "id": "token_vocab",
        "fit_on": [
          "x"
        ],
        "name": "default_vocab",
        "level": "token",
        "tokenizer": {
          "name": "split_tokenizer"
        },
        "save_path": "vocabs/token.dict",
        "load_path": "vocabs/token.dict"
      },
      {
        "id": "classes_vocab",
        "name": "default_vocab",
        "level": "token",
        "save_path": "vocabs/classes.dict",
        "load_path": "vocabs/classes.dict"
      },
      {
        "in": [
          "x"
        ],
        "in_y": [
          "y"
        ],
        "out": [
          "y_predicted"
        ],
        "main": true,
        "name": "go_bot_sql",
        "debug": false,
        "word_vocab": "#token_vocab",
        "template_path": "pharma/pharma-templates.txt",
        "req_sub_dict_path": "pharma/req_sub_dict.json",
        "database":"../../working_dev_only/NovartisQA.sqlite",
        "use_action_mask": false,
        "network": {
          "name": "go_bot_rnn",
          "load_path": "go_bot/model_pharma2",
          "save_path": "go_bot/model_pharma2",
          "learning_rate": 0.002,
          "dropout_rate": 0.8,
          "hidden_size": 128,
          "dense_size": 64,
          "obs_size": 2009,
          "action_size": 18
        },
        "slot_filler": {
          "name": "simple_slotfilling",
          "save_path": "slots",
          "file_name": "pharma.json",
          "threshold": 0.9
        },
        "intent_classifier": {
          "name": "intent_model",
          "save_path": "intents/intent_cnn_v3",
          "load_path": "intents/intent_cnn_v3",
          "classes": "#classes_vocab.keys()",
          "opt": {
            "train_now": true,
            "kernel_sizes_cnn": [
              3,
              3,
              3
            ],
            "filters_cnn": 512,
            "lear_metrics": [
              "binary_accuracy",
              "fmeasure"
            ],
            "confident_threshold": 0.5,
            "optimizer": "Adam",
            "lear_rate": 0.1,
            "lear_rate_decay": 0.1,
            "loss": "binary_crossentropy",
            "text_size": 15,
            "coef_reg_cnn": 1e-4,
            "coef_reg_den": 1e-4,
            "dropout_rate": 0.5,
            "epochs": 1,
            "dense_size": 100,
            "model_name": "cnn_model",
            "batch_size": 64,
            "val_every_n_epochs": 5,
            "verbose": true,
            "val_patience": 5
          },
          "embedder": {
            "name": "fasttext",
            "save_path": "embeddings/pharma_fastText_model.bin",
            "load_path": "embeddings/pharma_fastText_model.bin",
            "emb_module": "fasttext",
            "dim": 100
          },
          "tokenizer": {
            "name": "nltk_tokenizer",
            "tokenizer": "wordpunct_tokenize"
          }
        },
        "embedder": {
          "name": "fasttext",
          "save_path": "embeddings/wiki.simple.bin",
          "load_path": "embeddings/wiki.simple.bin",
          "emb_module": "fasttext",
          "dim": 300
        },
        "bow_encoder": {
          "name": "bow"
        },
        "tokenizer": {
          "name": "stream_spacy_tokenizer",
          "lowercase": false
        },
        "tracker": {
          "name": "featurized_tracker",
          "slot_names": [
            "<age_group>",
            "<condition>",
            "<delivery>",
            "<drug_name>",
            "<strength>",
            "<symptom>",
            "<units>"
          ]
        }
      }
    ]
  },
  "train": {
    "epochs": 200,
    "batch_size": 2,

    "metrics": ["per_item_dialog_accuracy"],
    "validation_patience": 20,
    "val_every_n_epochs": 1,

    "log_every_n_batches": -1,
    "log_every_n_epochs": 1,
    "show_examples": true
  }
}