Closed jsedoc closed 6 years ago
Two things come to mind
During inference, given that the true output sequence is not observed, we simply feed the predicted output token as input to predict the next output. This is a “greedy” inference approach.
Also for completeness what were the NCM vs Cakechat results?
@Henry-E
In my analysis, I have tried, greedy, beam search 5, and beam search 200. Unfortunately, 5 seemed the best for the vanilla Seq2Seq that I was running. I have also tried to increase my hidden state space to 4096 as per the paper, but I have lots of memory and convergence issues.
In contrast, Cakechat uses a beam of 200.
My experiments were without tied embeddings.
Comparing NCM with Cakechat -- 55% NCM "wins", whereas 31% Cakechat "wins".
As a side note, all of our human evaluations (via AMT) are available on an S3 bucket. See the SETC website for the link.
We can try training another model. We did not do anything problem specific for that, we just trained a basic model for comparison.
@srush If you'd like I've already got the 2013 OpenSubtitles dataset in OpenNMT format. I've trained dozens of models and have not gotten anywhere close to the reported results. So, I would love some help from experts at this point.
Huh, are there features we are missing? If all the hyperparameters are the same, I am not sure why it would be different. Can you post the commands you are running?
To be honest I don't know. This is a large part of the reason why we have started SETC has been because I'm having such difficulty with my baselines.
I still haven't submitted the PR without an attentional mechanism for OpenNMT-py yet. This is on my to-do list for next week.
I am moving the data for OpenSubtitles 2013 into my S3 bucket http://chatbot-eval-data.s3-accelerate.amazonaws.com/data/
It's not something I've spent much time investigating but I did notice worse performance using OpenNMT-py vs Nematus around the middle of last year for a shared task. I doubt there is anything wrong with OpenNMT-py, the Nematus hyperparamters I was using were probably better tuned, but it could be useful to do some benchmarking against other similar seq2seq frameworks or implementations.
I really doubt we are worse than Nematus, but it is possible we have some bad default hyperparameters. It would be helpful to know exactly what you are running and the log files. Happy to help find what is different.
(One possibility may be our birectional encoder model. I think we have a different default. Also other models have a -bridge
which you need to set explicitly.
Oh wow, I had not paid careful enough attention to the arguments of train.py and missed -bridge.
Thank you. I will rerun and report any improvements that I find.
I agree that it is unlikely to be worse. Especially now that OpenNMT-py has so many new features, eg transformer network.
One thing for sure is that hyperparameter options are not very well documented or discussed. For example in the E2E shared task Sebastian was able to get a model training in half the number of epochs with better performance by using the adadelta loss function versus the default sgd. It's something I could have picked up with a better hyperparameter search but part of the reason I didn't was wariness about breaking the model by training with conflicting options. Maybe this is more of a gitter discussion topic.
For standard Seq2Seq with OpenNMT-py checkpoints you can go here. In our paper, we found that according to our human evaluation our trained Seq2Seq is worse than the responses from the original NCM paper. However, Cakechat had closer in responses to NCM than our Seq2Seq models. Cakechat checkpoints are freely available.
With our new tool SETC, I've analyzed your checkpoint on the original test set from the Google NCM paper using Amazon Mechanical Turk. I used all 200 prompts with 3 Turkers for each prompt ...
Compared to the NCM -- 56 % of the prompts Turkers preferred NCM on average, 20% preferred your pre-trained model, and the rest were ties.
Compared with Cakechat - 43 % Cakechat "wins", 47% the available pre-trained model "wins". The rest were ties. Thus your model is not significantly better than Cakechat according to Turkers on the NCM eval set.
In summary, your pre-trained model is statistically significantly worse than the NCM using the NCM evaluation prompts and rated by human judges.
Does anyone have a pre-trained model that performs close to NCM??? If you think so, please fill out our form, and we will test for you.