AbrahamSanders / seq2seq-chatbot

A sequence2sequence chatbot implementation with TensorFlow.
MIT License
99 stars 56 forks source link

showing error #11

Open harshalpatilnmu opened 5 years ago

harshalpatilnmu commented 5 years ago

when I try to train model then it shows an error .. (aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\chatbot_dataset Traceback (most recent call last): File "train.py", line 14, in dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ e_session("train") File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin e 45, in initialize_session copyfile("hparams.json", os.path.join(model_dir, "hparams.json")) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\shutil .py", line 120, in copyfile with open(src, 'rb') as fsrc: FileNotFoundError: [Errno 2] No such file or directory: 'hparams.json'

AbrahamSanders commented 5 years ago

Hey @harshalpatilnmu,

Try these:

1) If you are on windows, you can use the training batch files inside the dataset directory. For example, datasets/cornell_movie_dialog/train_with_nnlm_en_embeddings.bat. These should set the working directory automatically. There are multiple batch files - each one configures training with different pre-trained embeddings in different configurations. To train your own embeddings, use train_with_random_embeddings.bat

2) If you are running the train.py file yourself, make sure your console working directory is set to the innermost seq2seq-chatbot path. Based on your log above, that would be D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot on your machine.

Also there are pre-trained models you can download and try out, see here: https://github.com/AbrahamSanders/seq2seq-chatbot/tree/master/seq2seq-chatbot/models/cornell_movie_dialog

harshalpatilnmu commented 5 years ago

My dataset is different so first I need to train the model then only I can run it..so how I can train my model. I follow your training command but it does not work for me. it shows an error.

AbrahamSanders commented 5 years ago

Make sure your console working directory is D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot. You should be able to see the hparams.json file directly in this folder. If you are unsure of the working directory, run it from an ipython console and set it manually:

ipython

import os
os.chdir(r"D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot")

train.py --datasetdir=datasets\cornell_movie_dialog
harshalpatilnmu commented 5 years ago

I set directory even it shows error..

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datasetdir=datasets\chatbot_dataset Traceback (most recent call last): File "train.py", line 14, in dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ e_session("train") File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin e 60, in initialize_session hparams = Hparams.load(hparams_filepath) File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\hparams.py", line 33, in load hparams = jsonpickle.decode(json) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p ackages\jsonpickle\unpickler.py", line 39, in decode data = backend.decode(string) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p ackages\jsonpickle\backend.py", line 194, in decode raise e File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p ackages\jsonpickle\backend.py", line 191, in decode return self.backend_decode(name, string) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\site-p ackages\jsonpickle\backend.py", line 203, in backend_decode return self._decoders[name](string, *optargs, **decoderkwargs) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\ _init__.py", line 319, in loads return _default_decoder.decode(s) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d ecoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\json\d ecoder.py", line 355, in raw_decode obj, end = self.scan_once(s, idx) json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33 (char 2 005)

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>

AbrahamSanders commented 5 years ago

The error message is saying: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33

Check the hparams.json file to make sure no comma is missing on a line that should have one. If you are not sure, copy and paste the file into the left box here: https://jsoneditoronline.org/ and it will automatically detect formatting errors.

I tried this with the committed version in the repository and there are no errors detected.

harshalpatilnmu commented 5 years ago

(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'... Traceback (most recent call last): File "train.py", line 31, in decoder_embeddings_dir = decoder_embeddings_dir) File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\dataset_reader.py", line 106, in read_dataset question = id2line[conversation[i]]

AbrahamSanders commented 5 years ago

Looks like your dataset is probably not formatted the same way as the cornell movie dialog dataset. You will need to implement a reader for your custom dataset:

See cornell_dataset_reader.py - this class implements the reader that converts the raw cornell files "movie_lines.txt" and "movie_conversations.txt" into the dict id2line and the list conversation_ids.

Duplicate this class, rename it and tweak the implementation to work with your own dataset format - all that matters is that the output is the same - id2line is a dictionary of dialog lines with unique ids, and conversations_ids is a list of sequences of dialog line ids (each sequence of ids represents a dialog between two people for one or more turns).

Once the new reader is implemented, register an instance of it in the dataset_reader_factory: readers = [CornellDatasetReader(), YourNewDatasetReader()]

Alternatively if you don't want to do all of this, modify your dataset so that it follows the same format as the cornell movie dialog dataset.

harshalpatilnmu commented 5 years ago

I have csv file in which data is formatted as questions and answers so how I can read it dataser_reader_factor.. I used pd.read_csv() function but I stucked in your code..how to use id2line and conversation. in my case my data is ready ..I dont need to split and replace it. could you help me for writing code. check following code..

""" Reader class for the Cornell movie dialog dataset """ from os import path

from dataset_readers.dataset_reader import DatasetReader import pandas as pd

class CornellDatasetReader(DatasetReader): """Reader implementation for the Cornell movie dialog dataset """ def init(self): super(CornellDatasetReader, self).init("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
    """Get dialog lines and conversations. See base class for explanation.
    Args:
        See base class
    """
   # movie_lines_filepath = path.join(dataset_dir, "movie_lines.txt")
   # movie_conversations_filepath = path.join(dataset_dir, "movie_conversations.txt")

    # Importing the dataset
    #with open(movie_lines_filepath, encoding="utf-8", errors="ignore") as file:
     #   lines = file.read()

    #with open(movie_conversations_filepath, encoding="utf-8", errors="ignore") as file:
    #    conversations = file.read()

    # Creating a dictionary that maps each line and its id
    #id2line = {}
    #for line in lines:
     #   _line = line.split(" +++$+++ ")
      #  if len(_line) == 5:
       #     id2line[_line[0]] = _line[4]

    # Creating a list of all of the conversations
    #conversations_ids = []
    #for conversation in conversations[:-1]:
     #   _conversation = conversation.split(" +++$+++ ")[-1][1:-1].replace("'", "").replace(" ", "")
      #  conv_ids = _conversation.split(",")
       # conversations_ids.append(conv_ids)

**data = pd.read_csv('abc_data.csv', encoding ='ISO-8859-1', header=None)

## Creating a dictionary that maps each line and its id
id2line=data.to_dict()[1]

#Creating a list of all of the conversations
conversations_ids = data.values.tolist()**

    return id2line, conversations_ids
AbrahamSanders commented 5 years ago

The base class is expecting the data in the format of a conversational log, such as: Person 1: Hello! Person 2: How are you? Person 1: Good, you? Person 2: Same here.

It infers question-answer pairs as follows: Question: Hello! --> Answer: How are you? Question: How are you? --> Answer: Good, you? Question: Good, you? --> Answer: Same here.

If you already have your data in this form, unfortunately you will need to present it as a log and let the base class put it back in that form. Further development could address this and enable a dataset like yours to be used directly - I will open a separate feature-request issue in the repo.

For now, you can take each question-answer pair from your CSV and do this (pseudo code):

for i, qa_pair in enumerate(csv):
  id2line.append("{}_q".format(i), qa_pair["question"])
  id2line.append("{}_a".format(i), qa_pair["answer"])
  conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])

return id2line, conversations_ids

One additional thing - you should set conv_history_length to 0 in hparams.json, both under training_hparams and inference_hparams. If you don't do this, the chatbot will prepend the last N conversation turns to the input as a sort of context, which is probably not what you want if you are trying to make a Q&A bot rather than a conversational bot.

Alternately, If you are willing to share your CSV, I can implement the reader and train it on my Titan V GPU.

harshalpatilnmu commented 5 years ago

Hi AbrahamSanders, Data is formatted as questions and answers. I am sharing csv file. this is dummy data but format is same.could you help me to write code. [Thanks.] csv_data.xlsx this file is in csv format

AbrahamSanders commented 5 years ago

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

harshalpatilnmu commented 5 years ago

@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py

Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)

Finally, follow the instructions here.

Let me know how it goes!

Some additional notes on hparam configuration (hparams.json):

If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.

Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.

Reply: following is my code

Reader class for the Cornell movie dialog dataset

"""

from os import path from dataset_readers.dataset_reader import DatasetReader import pandas as pd

class CornellDatasetReader(DatasetReader):

def __init__(self):
    super(CornellDatasetReader, self).__init__("cornell_movie_dialog")

def _get_dialog_lines_and_conversations(self, dataset_dir):
   data=pd.read_csv('full_data.csv', encoding='ISO-8859-1', header=None)
    print(data)
    id2line={}
    print(id2line)
    conversations_ids=[]
    for i, qa_pair in enumerate(data):
        id2line.append("{}_q".format(i), qa_pair["question"])
        id2line.append("{}_a".format(i), qa_pair["answer"])
        conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])
    return id2line, conversations_ids

error: (aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\cornell_movie_dialog

Reading dataset 'cornell_movie_dialog'... {} Traceback (most recent call last): File "train.py", line 31, in decoder_embeddings_dir = decoder_embeddings_dir) File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\datase t_reader.py", line 88, in read_dataset id2line, conversations_ids = self._get_dialog_lines_and_conversations(datase t_dir) File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\dataset_readers\cornel l_dataset_reader.py", line 62, in _get_dialog_lines_and_conversations id2line.append("{}_q".format(i), qa_pair["question"]) AttributeError: 'dict' object has no attribute 'append'

AbrahamSanders commented 5 years ago

@harshalpatilnmu please follow the directions in my last post. Revert cornell_dataset_reader.py and pull down the new reader as per my post. This should be able to process your CSV - I tested it successfully on the dummy data you sent me.

Also, make sure your data is in the directory \datasets\csv and not \datasets\cornell_movie_dialog as per the CSV readme

harshalpatilnmu commented 5 years ago

Thanks a lot for giving support, data is trained on dataset properly. I set the hparam inference_hparams/conv_history_length to 0 but it shows repeated answers. when I type question first time it shows correct answer and second time when I pass some information then chatbot will return output as previous output. so how I can avoid them.

AbrahamSanders commented 5 years ago

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue: 1) Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions: a) If your dataset is mostly common english words: change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english): No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

2) Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.

3) Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.

4) Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.

5) hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are: a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder. b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0

6) beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

harshalpatilnmu commented 5 years ago

@harshalpatilnmu you're welcome - I'm glad training is working for you now.

Here are a few considerations to help resolve your issue:

  1. Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.

To use pre-trained embeddings, follow these suggestions: a) If your dataset is mostly common english words: change model_hparams/encoder_embedding_trainable and model_hparams/decoder_embedding_trainable to false and change training_hparams/input_vocab_import_mode and training/hparams_output_vocab_import_mode to ExternalIntersectDataset

b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english): No additional changes needed to default hparams.json

To run it, use the training batch file with nnlm_en embeddings.

  1. Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.
  2. Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.
  3. Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.
  4. hparams If you change the inference hparams (like setting inference_hparams/conv_history_length to 0 in hparams.json) make sure you are: a) Changing the hparams.json in your model folder not in the base seq2seq-chatbot folder. b) If you change the hparams.json and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0
  5. beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N

I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!

File size is 157 KB.