Open harshalpatilnmu opened 6 years ago
Hey @harshalpatilnmu,
Try these:
1) If you are on windows, you can use the training batch files inside the dataset directory. For example, datasets/cornell_movie_dialog/train_with_nnlm_en_embeddings.bat. These should set the working directory automatically. There are multiple batch files - each one configures training with different pre-trained embeddings in different configurations. To train your own embeddings, use train_with_random_embeddings.bat
2) If you are running the train.py file yourself, make sure your console working directory is set to the innermost seq2seq-chatbot path. Based on your log above, that would be D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot on your machine.
Also there are pre-trained models you can download and try out, see here: https://github.com/AbrahamSanders/seq2seq-chatbot/tree/master/seq2seq-chatbot/models/cornell_movie_dialog
My dataset is different so first I need to train the model then only I can run it..so how I can train my model. I follow your training command but it does not work for me. it shows an error.
Make sure your console working directory is D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot. You should be able to see the hparams.json file directly in this folder. If you are unsure of the working directory, run it from an ipython console and set it manually:
ipython
import os
os.chdir(r"D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot")
train.py --datasetdir=datasets\cornell_movie_dialog
I set directory even it shows error..
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datasetdir=datasets\chatbot_dataset
Traceback (most recent call last):
File "train.py", line 14, in
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>
The error message is saying: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 57 column 33
Check the hparams.json file to make sure no comma is missing on a line that should have one. If you are not sure, copy and paste the file into the left box here: https://jsoneditoronline.org/ and it will automatically detect formatting errors.
I tried this with the committed version in the repository and there are no errors detected.
(aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\cornell_movie_dialog
Reading dataset 'cornell_movie_dialog'...
Traceback (most recent call last):
File "train.py", line 31, in
Looks like your dataset is probably not formatted the same way as the cornell movie dialog dataset. You will need to implement a reader for your custom dataset:
See cornell_dataset_reader.py - this class implements the reader that converts the raw cornell files "movie_lines.txt" and "movie_conversations.txt" into the dict id2line
and the list conversation_ids
.
Duplicate this class, rename it and tweak the implementation to work with your own dataset format - all that matters is that the output is the same - id2line
is a dictionary of dialog lines with unique ids, and conversations_ids is a list of sequences of dialog line ids (each sequence of ids represents a dialog between two people for one or more turns).
Once the new reader is implemented, register an instance of it in the dataset_reader_factory:
readers = [CornellDatasetReader(), YourNewDatasetReader()]
Alternatively if you don't want to do all of this, modify your dataset so that it follows the same format as the cornell movie dialog dataset.
I have csv file in which data is formatted as questions and answers so how I can read it dataser_reader_factor.. I used pd.read_csv() function but I stucked in your code..how to use id2line and conversation. in my case my data is ready ..I dont need to split and replace it. could you help me for writing code. check following code..
""" Reader class for the Cornell movie dialog dataset """ from os import path
from dataset_readers.dataset_reader import DatasetReader import pandas as pd
class CornellDatasetReader(DatasetReader): """Reader implementation for the Cornell movie dialog dataset """ def init(self): super(CornellDatasetReader, self).init("cornell_movie_dialog")
def _get_dialog_lines_and_conversations(self, dataset_dir):
"""Get dialog lines and conversations. See base class for explanation.
Args:
See base class
"""
# movie_lines_filepath = path.join(dataset_dir, "movie_lines.txt")
# movie_conversations_filepath = path.join(dataset_dir, "movie_conversations.txt")
# Importing the dataset
#with open(movie_lines_filepath, encoding="utf-8", errors="ignore") as file:
# lines = file.read()
#with open(movie_conversations_filepath, encoding="utf-8", errors="ignore") as file:
# conversations = file.read()
# Creating a dictionary that maps each line and its id
#id2line = {}
#for line in lines:
# _line = line.split(" +++$+++ ")
# if len(_line) == 5:
# id2line[_line[0]] = _line[4]
# Creating a list of all of the conversations
#conversations_ids = []
#for conversation in conversations[:-1]:
# _conversation = conversation.split(" +++$+++ ")[-1][1:-1].replace("'", "").replace(" ", "")
# conv_ids = _conversation.split(",")
# conversations_ids.append(conv_ids)
**data = pd.read_csv('abc_data.csv', encoding ='ISO-8859-1', header=None)
## Creating a dictionary that maps each line and its id
id2line=data.to_dict()[1]
#Creating a list of all of the conversations
conversations_ids = data.values.tolist()**
return id2line, conversations_ids
The base class is expecting the data in the format of a conversational log, such as: Person 1: Hello! Person 2: How are you? Person 1: Good, you? Person 2: Same here.
It infers question-answer pairs as follows: Question: Hello! --> Answer: How are you? Question: How are you? --> Answer: Good, you? Question: Good, you? --> Answer: Same here.
If you already have your data in this form, unfortunately you will need to present it as a log and let the base class put it back in that form. Further development could address this and enable a dataset like yours to be used directly - I will open a separate feature-request issue in the repo.
For now, you can take each question-answer pair from your CSV and do this (pseudo code):
for i, qa_pair in enumerate(csv):
id2line.append("{}_q".format(i), qa_pair["question"])
id2line.append("{}_a".format(i), qa_pair["answer"])
conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])
return id2line, conversations_ids
One additional thing - you should set conv_history_length
to 0 in hparams.json, both under training_hparams
and inference_hparams
. If you don't do this, the chatbot will prepend the last N conversation turns to the input as a sort of context, which is probably not what you want if you are trying to make a Q&A bot rather than a conversational bot.
Alternately, If you are willing to share your CSV, I can implement the reader and train it on my Titan V GPU.
Hi AbrahamSanders, Data is formatted as questions and answers. I am sharing csv file. this is dummy data but format is same.could you help me to write code. [Thanks.] csv_data.xlsx this file is in csv format
@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py
Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)
Finally, follow the instructions here.
Let me know how it goes!
Some additional notes on hparam configuration (hparams.json):
If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.
Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.
@harshalpatilnmu, pull down csv_dataset_reader.py and dataset_reader_factory.py
Make sure to save your data as a CSV (I don't know if Pandas will accept .xlsx)
Finally, follow the instructions here.
Let me know how it goes!
Some additional notes on hparam configuration (hparams.json):
If you have a basic Q&A dataset, set the hparam inference_hparams/conv_history_length to 0 so that it will treat each question independently while chatting.
Also, you can reduce the size of your model if you have a smaller dataset. The default is pretty big - 4 layer encoder/decoder, 1024 cell units per layer. You can choose to train with the sgd or adam optimizers - the default learning rate is good for sgd, but if you use adam then lower it to 0.001.
Reply: following is my code
from os import path from dataset_readers.dataset_reader import DatasetReader import pandas as pd
class CornellDatasetReader(DatasetReader):
def __init__(self):
super(CornellDatasetReader, self).__init__("cornell_movie_dialog")
def _get_dialog_lines_and_conversations(self, dataset_dir):
data=pd.read_csv('full_data.csv', encoding='ISO-8859-1', header=None)
print(data)
id2line={}
print(id2line)
conversations_ids=[]
for i, qa_pair in enumerate(data):
id2line.append("{}_q".format(i), qa_pair["question"])
id2line.append("{}_a".format(i), qa_pair["answer"])
conversations_ids.append(["{}_q".format(i), "{}_a".format(i)])
return id2line, conversations_ids
error: (aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\cornell_movie_dialog
Reading dataset 'cornell_movie_dialog'...
{}
Traceback (most recent call last):
File "train.py", line 31, in
@harshalpatilnmu please follow the directions in my last post. Revert cornell_dataset_reader.py and pull down the new reader as per my post. This should be able to process your CSV - I tested it successfully on the dummy data you sent me.
Also, make sure your data is in the directory \datasets\csv
and not \datasets\cornell_movie_dialog
as per the CSV readme
Thanks a lot for giving support, data is trained on dataset properly. I set the hparam inference_hparams/conv_history_length to 0 but it shows repeated answers. when I type question first time it shows correct answer and second time when I pass some information then chatbot will return output as previous output. so how I can avoid them.
@harshalpatilnmu you're welcome - I'm glad training is working for you now.
Here are a few considerations to help resolve your issue: 1) Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.
To use pre-trained embeddings, follow these suggestions:
a) If your dataset is mostly common english words:
change model_hparams/encoder_embedding_trainable
and model_hparams/decoder_embedding_trainable
to false and change training_hparams/input_vocab_import_mode
and training/hparams_output_vocab_import_mode
to ExternalIntersectDataset
b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english): No additional changes needed to default hparams.json
To run it, use the training batch file with nnlm_en embeddings.
2) Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.
3) Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.
4) Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.
5) hparams If you change the inference hparams (like setting inference_hparams/conv_history_length
to 0 in hparams.json
) make sure you are:
a) Changing the hparams.json
in your model folder not in the base seq2seq-chatbot folder.
b) If you change the hparams.json
and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing --convhistlength=0
6) beamsearch beamsearch can be tweaked to optimize your output. The default model_hparams/beam_width
is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with --beamwidth=N
Also you can influence the weights used in beam ranking by changing inference_hparams/beam_length_penalty_weight
. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with --beamlenpenalty=N
I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!
@harshalpatilnmu you're welcome - I'm glad training is working for you now.
Here are a few considerations to help resolve your issue:
- Size of the dataset - How many training examples are in your dataset? If it is too small, the model will not be able to generalize linguistic rules and is likely to overfit. There is no exact number of examples that would be considered a large enough dataset, but the general rule is the bigger the better. If you have a small dataset you can try training with frozen pre-trained embeddings.
To use pre-trained embeddings, follow these suggestions: a) If your dataset is mostly common english words: change
model_hparams/encoder_embedding_trainable
andmodel_hparams/decoder_embedding_trainable
to false and changetraining_hparams/input_vocab_import_mode
andtraining/hparams_output_vocab_import_mode
toExternalIntersectDataset
b) if your dataset is mostly technical, proprietary, or domain-specific words (or words in a language other than english): No additional changes needed to default hparams.json
To run it, use the training batch file with nnlm_en embeddings.
- Unbalanced dataset - If your dataset is unbalanced then you can run into this kind of issue. For example if you have 10,000 questions where 5,000 of them have the same answer "I don't know" and the other 5,000 have unique answers, then your model will likely respond with "I don't know" all the time. A loose way of looking at this would be that for any given question, there is at least a 50% chance that the answer is "I don't know". And as you probably already know the beam-search decoding is taking the sequence with the highest cumulative probability given the encoded input.
- Underfitting - If you underfit (don't train enough), then the model could spit the same response out again and again due to beam search selecting the cumulatively most likely sequence. In an underfit model, this sequence would be the one that appears the most in your answer set.
- Model size if your model is too small then it could cause underfitting. If it is too big it could cause overfitting. The default model size is 4 layers x 1024 units with a bi-directional encoder (2 forward 2 backward). This is appropriate for the cornell dataset with 300,000 training examples. If you have smaller dataset try a smaller model.
- hparams If you change the inference hparams (like
setting inference_hparams/conv_history_length
to 0 inhparams.json
) make sure you are: a) Changing thehparams.json
in your model folder not in the base seq2seq-chatbot folder. b) If you change thehparams.json
and save it, you must restart the chat script. If you want to change hparams on the fly at runtime for the current session only, use the commands. For example, you can set conv_history_length to 0 for the current session at runtime by typing--convhistlength=0
- beamsearch beamsearch can be tweaked to optimize your output. The default
model_hparams/beam_width
is 20. Try lowering it or raising it. Setting it to 0 disables beam search and uses greedy decoding. This can be done also at runtime with--beamwidth=N
Also you can influence the weights used in beam ranking by changinginference_hparams/beam_length_penalty_weight
. The default is 1.25 but you can try raising it or lowering it. Higher weights result in longer sequences being preferred while lower weights result in shorter sequences being preferred. You can do this at runtime with--beamlenpenalty=N
I hope I have given you enough info to optimize your model. Let me know how it goes, I am happy to answer any questions!
File size is 157 KB.
when I try to train model then it shows an error .. (aiml) D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot>python train.py --datas etdir=datasets\chatbot_dataset Traceback (most recent call last): File "train.py", line 14, in
dataset_dir, model_dir, hparams, resume_checkpoint = general_utils.initializ
e_session("train")
File "D:\chatbot\seq2seq-chatbot-master\seq2seq-chatbot\general_utils.py", lin
e 45, in initialize_session
copyfile("hparams.json", os.path.join(model_dir, "hparams.json"))
File "C:\Users\1patilha\AppData\Local\Continuum\anaconda3\envs\aiml\lib\shutil
.py", line 120, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'hparams.json'