Reproduce Neural Conversation Model

jsedoc commented 6 years ago

For standard Seq2Seq with OpenNMT-py checkpoints you can go here. In our paper, we found that according to our human evaluation our trained Seq2Seq is worse than the responses from the original NCM paper. However, Cakechat had closer in responses to NCM than our Seq2Seq models. Cakechat checkpoints are freely available.

With our new tool SETC, I've analyzed your checkpoint on the original test set from the Google NCM paper using Amazon Mechanical Turk. I used all 200 prompts with 3 Turkers for each prompt ...

Compared to the NCM -- 56 % of the prompts Turkers preferred NCM on average, 20% preferred your pre-trained model, and the rest were ties.

Compared with Cakechat - 43 % Cakechat "wins", 47% the available pre-trained model "wins". The rest were ties. Thus your model is not significantly better than Cakechat according to Turkers on the NCM eval set.

In summary, your pre-trained model is statistically significantly worse than the NCM using the NCM evaluation prompts and rated by human judges.

Does anyone have a pre-trained model that performs close to NCM??? If you think so, please fill out our form, and we will test for you.

Henry-E commented 6 years ago

Two things come to mind

Are you using beam search to generate the responses? In the original paper they used a "greedy" approach. Beam search might result in worse responses if the randomness aspect makes it less intelligble. During inference, given that the true output sequence is not observed, we simply feed the predicted output token as input to predict the next output. This is a “greedy” inference approach.
The baseline OpenNMT model wasn't trained with tied weights. I'm not sure how much of a difference this would make but I'm guessing in the NCM implementation the word embeddings and the cell weights were shared between the encoder and the decoder. Which is not the default with OpenNMT-py.

Also for completeness what were the NCM vs Cakechat results?

jsedoc commented 6 years ago

@Henry-E

In my analysis, I have tried, greedy, beam search 5, and beam search 200. Unfortunately, 5 seemed the best for the vanilla Seq2Seq that I was running. I have also tried to increase my hidden state space to 4096 as per the paper, but I have lots of memory and convergence issues.
In contrast, Cakechat uses a beam of 200.
My experiments were without tied embeddings.

Comparing NCM with Cakechat -- 55% NCM "wins", whereas 31% Cakechat "wins".

As a side note, all of our human evaluations (via AMT) are available on an S3 bucket. See the SETC website for the link.

srush commented 6 years ago

We can try training another model. We did not do anything problem specific for that, we just trained a basic model for comparison.

jsedoc commented 6 years ago

@srush If you'd like I've already got the 2013 OpenSubtitles dataset in OpenNMT format. I've trained dozens of models and have not gotten anywhere close to the reported results. So, I would love some help from experts at this point.

srush commented 6 years ago

Huh, are there features we are missing? If all the hyperparameters are the same, I am not sure why it would be different. Can you post the commands you are running?

jsedoc commented 6 years ago

To be honest I don't know. This is a large part of the reason why we have started SETC has been because I'm having such difficulty with my baselines.

I still haven't submitted the PR without an attentional mechanism for OpenNMT-py yet. This is on my to-do list for next week.

I am moving the data for OpenSubtitles 2013 into my S3 bucket http://chatbot-eval-data.s3-accelerate.amazonaws.com/data/

Henry-E commented 6 years ago

It's not something I've spent much time investigating but I did notice worse performance using OpenNMT-py vs Nematus around the middle of last year for a shared task. I doubt there is anything wrong with OpenNMT-py, the Nematus hyperparamters I was using were probably better tuned, but it could be useful to do some benchmarking against other similar seq2seq frameworks or implementations.

srush commented 6 years ago

I really doubt we are worse than Nematus, but it is possible we have some bad default hyperparameters. It would be helpful to know exactly what you are running and the log files. Happy to help find what is different.

srush commented 6 years ago

(One possibility may be our birectional encoder model. I think we have a different default. Also other models have a -bridge which you need to set explicitly.

jsedoc commented 6 years ago

Oh wow, I had not paid careful enough attention to the arguments of train.py and missed -bridge.

Thank you. I will rerun and report any improvements that I find.

Henry-E commented 6 years ago

I agree that it is unlikely to be worse. Especially now that OpenNMT-py has so many new features, eg transformer network.

One thing for sure is that hyperparameter options are not very well documented or discussed. For example in the E2E shared task Sebastian was able to get a model training in half the number of epochs with better performance by using the adadelta loss function versus the default sgd. It's something I could have picked up with a better hyperparameter search but part of the reason I didn't was wariness about breaking the model by training with conflicting options. Maybe this is more of a gitter discussion topic.

OpenNMT / OpenNMT-py

Reproduce Neural Conversation Model #612