seq2seq not able to replicate results

hsgodhia commented 6 years ago

Hi

I am trying to get running a simple seq2seq model with decent results on opensubtitles. I ran the below command on a 12GB GPU Ram Nvidia for 15 hours but the results are not as I am expecting, I was expecting results like Neural Conversational Model paper (1506.05869)

python3.6 examples/train_model.py -e 13 -m seq2seq -mf godz7 -t opensubtitles -dt train:stream -hist 1 -bs 32 -tr 100 --dict-maxexs=10000000 --gpu 2 --batch-sort false -hs 500 -esz 500 -nl 2 -emb glove -att general -dr 0.3 -lr 0.001 -clip 5

I have tried different variants of the hidden size from [2048, 1024, 512] and similarly embedding size with tradeoff in batch size so that RAM capacity is not crossed. Also tried the default options which come with default seq2seq but results are not good. Any tips on where I may be going wrong?

Sample results like-

Enter Your Message: hi there buddy
prediction:  hypothalamus
[Seq2Seq]: hypothalamus
Enter Your Message: ok maybe something better?
[Seq2Seq]: dogged
Enter Your Message: why are these 1 word
prediction:  nineteen
[Seq2Seq]: nineteen
Enter Your Message: and why is it not multi
prediction:  llttie
[Seq2Seq]: llttie
Enter Your Message: ok anyways
prediction:  bunting
[Seq2Seq]: bunting
Enter Your Message: i missed it
prediction:  7OO
[Seq2Seq]: 7OO
Enter Your Message: is this going to work
prediction:  interviewee
[Seq2Seq]: interviewee
Enter Your Message: i guess its just not
[Seq2Seq]: interviewee
Enter Your Message: huh is that repeating
prediction:  Simpson
[Seq2Seq]: Simpson

urikz commented 6 years ago

One of possible explanations might be that Neural Conversational Model paper is using another version of Open Subtitles corpora (2013 in particular), which is 60-70 times larger, than the one used in ParlAI right now. This PR is adding newer version - https://github.com/facebookresearch/ParlAI/pull/562

Also, the learning rate looks pretty small. As far as I remember, they used 1.0.

hsgodhia commented 6 years ago

Actually I downloaded the 2018 data which had a vocabulary about 100k and about 100M dialogs and trained on that, about the learning rate I'm not sure if with an adam optimizer lr = 1?

jaseweston commented 6 years ago

We are getting decent results on the Twitter dataset with https://github.com/facebookresearch/ParlAI/tree/master/parlai/agents/language_model you could also try that? although @emilydinan is about to push a small change to it that seems to help (adding PERSON1, PERSON2 tags to indicate change of speaker)

emilydinan commented 6 years ago

@hsgodhia @urikz the small change that @jaseweston was referring to can be found in the PR here: https://github.com/facebookresearch/ParlAI/pull/573/files

emilydinan commented 6 years ago

@hsgodhia @urikz an additional change that may help is limiting the number of tokens used in your dictionary. A new flag '--dict-maxtokens' allows you to take the top N tokens from the dictionary after sorting by frequency (see here https://github.com/facebookresearch/ParlAI/pull/565 for the PR that added this)

ShaojieJiang commented 6 years ago

@jaseweston You mean decent results with default parameter settings? Or can you please share the hyper parameters?

jaseweston commented 6 years ago

@emilydinan i think default?

emilydinan commented 6 years ago

@Jackberg I got somewhat decent results using the language_model in ParlAI, not seq2seq. The hyperparameters I used for that are -vtim 360 -esz 200 -hs 500 -nl 2 -lr 10 -bs 20 (and this is on the Twitter datset, currently training some on the new opensubtitles)

ShaojieJiang commented 6 years ago

@emilydinan Thanks so much!

hsgodhia commented 6 years ago

Would I be doing something wrong as I run this to do evaluation after training python3.6 examples/interactive.py -dt test:stream -t opensubtitles -m language_model -bs 1 -mf godz2 and I basically get ` [creating task(s): parlai.agents.local_human.local_human:LocalHumanAgent] Enter Your Message: hih

Enter Your Message: hi.

Enter Your Message: hi ` usually I would expected the predictions to come next.

emilydinan commented 6 years ago

@hsgodhia This PR should fix that: https://github.com/facebookresearch/ParlAI/pull/579 . Sorry about that-- what was happening is that in interactive mode, the Language Model defaults to training mode since "eval_labels" are not present, and so predictions are not produced.

hsgodhia commented 6 years ago

Hi

I'm not sure if it was fixed. I did a git pull a second back and running the command give python3.6 examples/interactive.py -dt test:stream -t opensubtitles -m language_model -bs 1 -mf godz2

hals/ParlAI'}
[ no model yet at: godz2 ]
[ Using CUDA ]
Loading existing model params from godz2
Overriding option [ hiddensize: 200 => 500]
Dictionary: loading dictionary from godz2.dict
[ num words =  100000 ]
[creating task(s): parlai.agents.local_human.local_human:LocalHumanAgent]
Enter Your Message: hi there harshal
Traceback (most recent call last):
  File "examples/interactive.py", line 47, in <module>
    main()
  File "examples/interactive.py", line 38, in main
    world.parley()
  File "/home/harshals/ParlAI/parlai/core/worlds.py", line 291, in parley
    acts[1] = agents[1].act()
  File "/home/harshals/ParlAI/parlai/agents/language_model/language_model.py", line 463, in act
    return self.batch_act([self.observation])[0]
  File "/home/harshals/ParlAI/parlai/agents/language_model/language_model.py", line 441, in batch_act
    output, hidden, loss_dict, predictions = self.predict(data_list[i], self.hidden, targets_list[i], self.is_training, y_lens)
  File "/home/harshals/ParlAI/parlai/agents/language_model/language_model.py", line 365, in predict
    loss = self.get_target_loss(data, self.hidden, targets, y_lens)
  File "/home/harshals/ParlAI/parlai/agents/language_model/language_model.py", line 296, in get_target_loss
    loss += self.eval_criterion(output_flat, targets.select(1,0).view(-1)).data
RuntimeError: dimension specified as 1 but tensor has no dimensions

emilydinan commented 6 years ago

@hsgodhia The problem is with this line: https://github.com/facebookresearch/ParlAI/blob/master/parlai/agents/language_model/language_model.py#L365 since this targets are None in interactive mode. Currently working on a fix-- I'll keep you updated-- but in the meantime for testing purposes you can comment out that line and replace with "loss = 0".

emilydinan commented 6 years ago

@hsgodhia the fix is merged now: https://github.com/facebookresearch/ParlAI/pull/580

hsgodhia commented 6 years ago

Got it. Yes, working on my side now. Here are some results after training for 15hrs with above parameters on latest version of opensubtitles

[ num words =  100000 ]
[creating task(s): parlai.agents.local_human.local_human:LocalHumanAgent]
Enter Your Message: hi there how are you doing?
[LanguageModel]: PERSON2 I 'm fine .
Enter Your Message: Is it raining today?
[LanguageModel]: PERSON2 I 'm going to count to three .
Enter Your Message: where are you going for dinner?
[LanguageModel]: PERSON2 We 're in the middle of the night .
Enter Your Message: which city are you from?
[LanguageModel]: PERSON2 I 'm in charge of this situation .
Enter Your Message: is the movie black panther good?
[LanguageModel]: PERSON2 I don 't know .
Enter Your Message: is east coast better or west coast?
[LanguageModel]: PERSON2 I 'm sure I 'll see it later .
Enter Your Message: training you was not fun
[LanguageModel]: PERSON2 I was just trying to get up on the roof and see if I could signal for help .
Enter Your Message: fine
[LanguageModel]: PERSON2 I 'm not gonna hurt you .
Enter Your Message: bye
[LanguageModel]: PERSON2 I 'm not gonna hurt you !
Enter Your Message: hi
[LanguageModel]: PERSON2 I 'm here , John .
Enter Your Message: see you
[LanguageModel]: PERSON2 I 'm going to count to three .
Enter Your Message: whatever
[LanguageModel]: PERSON2 I 'm not gonna hurt you .

Would it be possible to get a set of parameters to replicate results of the seq2seq agent, I believe bidirection and attention would give better results but the seq2seq agent currently collapses to uttering one response or not great results

ShaojieJiang commented 6 years ago

Hi all, here's what I found and some guesses:

Language model does perform well on Twitter task!
Yet it performs bad on OpenSubtitles dataset by constantly responding "I don't know". The reason I guess is two-fold: 1) the 2009 dataset is not big enough; 2) the training is case-sensitive. And through experiments, I found using lowercase dictionary did make the performance way better.
However, Seq2Seq model converges very bad on Twitter dataset. I think it uses the same loss-function and perplexity metric as language model (didn't have time to look deep into this part), right? But the loss of language model is always below 10 while Seq2Seq loss always being ~200.
From the experiments I did on OpenSubtitles, I guess the problem of Seq2Seq is related to learning rate implementation? Because LM by default uses LR=20, while Seq2Seq defaults to LR=0.005, and bigger values (roughly higher than 0.5) will make the training diverge.

Hope you guys can locate the problem! Cheers!

jaseweston commented 6 years ago

On Fri, Feb 16, 2018 at 11:48 AM, Jackberg notifications@github.com wrote:

Hi all, here's what I found and some guesses:

Language model does perform well on Twitter task!

Yet it performs bad on OpenSubtitles dataset by constantly responding "I don't know". The reason I guess is two-fold: 1) the 2009 dataset is not big enough; 2) the training is case-sensitive. And through experiments, I found using lowercase dictionary did make the performance way better.

This is a classic known problem in these kind of models, mentioned in several papers, see e.g.: https://arxiv.org/pdf/1510.03055v2.pdf

However, Seq2Seq model converges very bad on Twitter dataset. I think it uses the same loss-function and perplexity metric as language model (didn't have time to look deep into this part), right? But the loss of language model is always below 10 while Seq2Seq loss always being ~200.

From the experiments I did on OpenSubtitles, I guess the problem of Seq2Seq is related to learning rate implementation? Because LM by default uses LR=20, while Seq2Seq defaults to LR=0.005, and bigger values (roughly higher than 0.5) will make the training diverge.

Hope you guys can locate the problem! Cheers!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ParlAI/issues/550#issuecomment-366291192, or mute the thread https://github.com/notifications/unsubscribe-auth/AKjk-NS2QRn0MBLYMYjMquV31caL_t0Kks5tVbFmgaJpZM4R9a2y .

ShaojieJiang commented 6 years ago

@jaseweston Sorry, which problem do you mean, the general response "I don't know"? Actually by saying "constantly", I mean it will response this to almost every user utterance. And this problem can be solved by using lowercase dictionary.

So here I'm suggesting to use lowercase words by default.

jaseweston commented 6 years ago

Yes, the "I don't know problem", check that paper link. It cannot be completely solved easily, it seems, although can be fixed to some degree with tricks indeed.

On Fri, Feb 16, 2018 at 11:59 AM, Jackberg notifications@github.com wrote:

@jaseweston https://github.com/jaseweston Sorry, which problem do you mean, the general response "I don't know"? Actually by saying "constantly", I mean it will response this to almost every user utterance. And this problem can be solved by using lowercase dictionary.

So here I'm suggesting to use lowercase words by default.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/ParlAI/issues/550#issuecomment-366294392, or mute the thread https://github.com/notifications/unsubscribe-auth/AKjk-NCzcEV-5_n1qqsSz3Flece5IUjuks5tVbPegaJpZM4R9a2y .

ShaojieJiang commented 6 years ago

@jaseweston Sorry for the confusion. I didn't mean to solve this problem. I just meant that the case-sensitive dataset will make the model perform really bad by always giving the same response.

emilydinan commented 6 years ago

Hi @Jackberg -- thanks for your notes! That's great that you could also see that language model performing well on the Twitter task. Some comments about the opensubtitles training...

Loss difference: in the language model the loss is being averaged over the number of tokens. In the seq2seq model, as it is currently, it is averaged over the number of tokens and then multiplied by the length of the inputs (which is why it seems off by a factor of about 20). See here: https://github.com/facebookresearch/ParlAI/blob/c3827177770883f12b68c4cb38b3a9611a84323a/parlai/agents/seq2seq/seq2seq.py#L390 . I am going to fix this so that this is no longer the case -- thanks for pointing this out.

Hope this helps. @alexholdenmiller might have some thoughts on which hyperparameters to use for seq2seq training...

ShaojieJiang commented 6 years ago

@emilydinan @alexholdenmiller Great! Looking forward to your recipe for seq2seq model!

hsgodhia commented 6 years ago

seq2seq is currently doing clip https://github.com/facebookresearch/ParlAI/blob/c3827177770883f12b68c4cb38b3a9611a84323a/parlai/agents/seq2seq/seq2seq.py#L245 and provides torch.optim optimizers (default adam) I also think that the language model may benefit from having an adaptive learning rate

emilydinan commented 6 years ago

@hsgodhia -- you're right, that was added recently. I missed that. I edited my comment.

ShaojieJiang commented 6 years ago

Hi @hsgodhia , I've figured out the causes for the bad performance of your testing of seq2seq model–the attention model is making a damage here. Here's what I got during 1 epoch using -att none:

TEXT:  yeah, to get there, we need to make one really good syntax/formatter. opens up arbitrary.
PREDICTION:  i love you so much for your own .
TEXT:  nu-skin? 😂 gop congressman jason chaffetz is being financed by an illegal chinese pyramid scheme via
PREDICTION:  i love you so much !

Looks much better huh? So I believe there's something wrong in the attention model @emilydinan . Currently it can only give 1 or 2 responses, I'll train it for more epochs and see what it'll learn.

alexholdenmiller commented 6 years ago

@Jackberg I'm having trouble getting it to produce high quality text on standard opensubtitles, we're still working on a few other changes to try to improve it

alexholdenmiller commented 6 years ago

I'm seeing the same thing with really low quality attention generation

alexholdenmiller commented 6 years ago

Bug in attention fixed thanks to @Jackberg, attention is looking much better. Also added post-attention (attention is calculated using output representation of RNN instead of input token representation) and that's looking better as well.

alexholdenmiller commented 6 years ago

Adding IBM's seq2seq model if that's helpful for comparison: #601

alexholdenmiller commented 6 years ago

Hi all, I'm going to close this with an summary of running a single new train job on opensubtitles. I think the biggest change compared to the past is a better tokenizer.

I trained the seq2seq model again on opensubtitles 2009 with the following command:

python examples/train_model.py -gpu 1 -m seq2seq -t opensubtitles:v2009 --dict-lower true -tok re --dict-maxtokens 100000 -hs 2048 -esz 300 -emb glove -opt sgd -lr 3 -dr 0.1 -att none -bs 32 -tr 120 --dict-include-valid false -nl 2 -clip 0.1 -lt enc_dec -histsz 7 -pt true -mom 0.9 -vp 16 -veps 0.25 -mf /tmp/os_s2s_2 -ltim 10 -vmt ppl -vmm min

So, this model used a hidden size of 2048 (half that of "A Neural Conversational Model", Vinyals 2015), embedding size of 300 (I couldn't tell--maybe they used 2048?), 2 layer LSTM (same as vinyals), vocab size of 100k (same as vinyals), no attention (same as vinyals).

They said they converged to 17 ppl on the validation set. We calculated a perplexity of 24.95 on the validation set, 21.36 on the test set, including end tokens but only calculated on the "right hand side"--that is, only the outputs of the model given an input (producing "my name is bill" in response to the input "human what is your name ?"), not over the entire sequence concatenated together ("human what is your name ? machine my name is bill" or something similar).

This model was running validation every 0.25 epochs, and reached it's best valid performance after 8.25 epochs. I did not sweep over parameters but just ran the above training, I expect this could be improved further.

facebookresearch / ParlAI

seq2seq not able to replicate results #550