carpedm20 / lstm-char-cnn-tensorflow

in progress
MIT License
761 stars 243 forks source link

Validation perplexity is 146.71 at the end of training (24 epochs) #3

Open ygoncharov opened 8 years ago

ygoncharov commented 8 years ago

(it should get ~82 on valid and ~79 on test)

$ python main.py --dataset ptb

.....

epoch: [24] [ 250/ 265] loss: 3.466149 Valid: loss: 5.225354, perplexity: 185.927017 {'perplexity': 83.749542031012467, 'epoch': 24, 'valid_perplexity': 146.71359295576036, 'learningrate': 0.5} [] Saving checkpoints... Test: loss: 4.836956, perplexity: 126.084908 [_] Test loss: 4.954320, perplexity: 141.786226

carpedm20 commented 8 years ago

I'm working on this issue and I don't think the current implementation is different from the original model. I checked the model validity by comparing the losses of a single batch during the early epochs and there are no differences. Also, I checked the perplexity of training set goes down to 90.

loss

One thing I'm working on is to change the testing algorithm which is different from the original. The original code calculate the whole perplexity of all test data in a single forward pass but this repo calculates the perplexity of test data same as the training data, which is batch averaged perplexity. This will reduce the perplexity in some way.. but not sure this will make the comparable results.

If you find any other differences, feel free to share it to me :smile:

yoonkim commented 8 years ago

Cool stuff! I noticed on the README that you are using 100/150 hidden units for small/large models respectively. I actually use 300/650 hidden units, so this might explain the difference in performance. Also, it seems like you are using RMSProp? I've found vanilla SGD with starting learning rate of 1.0 (halved every time the perplexity does not improve on dev set) to work much better than other optimization methods, including RMSProp.

Hope this helps.

carpedm20 commented 8 years ago

@yoonkim Hi! Thanks for sharing your great work and I enjoyed the paper very well! Actually, README is an old one which I forgot to update it (now I fixed it) and the code already uses same hidden units, optimizer, and decay as you mentioned..

yoonkim commented 8 years ago

Ah ok! Few other things may be:

carpedm20 commented 8 years ago

Thanks! I'll dig into those things and how was the perplexity on training set after the training?

yoonkim commented 8 years ago

I think it should be a lot lower. I don't recall the numbers exactly but since the dataset is small and the model has a lot of capacity (even with dropout) training PPL should be well below 50.

nileshkulkarni commented 8 years ago

@carpedm20 Hi, Did you find any possibles pointers on this issue of high test perplexity? I was trying to debug it and any help would be appreciated.

yss4 commented 8 years ago

@carpedm20 Hello, thanks for sharing your code in github. I also noticed that the problem of getting high perplexity on PTB test set is still ongoing. Have you had a chance to deal with this issue or any pointer to fix it? Thanks in advance.

carpedm20 commented 8 years ago

@nileshkulkarni @yss4 No, I couldn't find the reason of problem yet and I'm not working on this project now. But if you share me any weird codes that is different from the original paper, please share it and I'll take a look at it.

mkroutikov commented 8 years ago

@carpedm20 This implementation is NOT identical to the original.

Interested reader can have a look at my code here: https://github.com/mkroutikov/tf-lstm-char-cnn that does reproduce Yoon Kim's redult in TF.

hejunqing commented 8 years ago

I ran the code yesterday and received a result of 156.097 averaged validation PPL, 149.565 averaged test PPL. So I am reading your code and the original.The first different thing I found was the criterion, yours is CE while the original is NLL.Does it matter?

guanghuixu commented 7 years ago

Thanks for sharing your code. I want to know how can I train a model in word_level? I found you code has the things like ( use_char = Ture, use_word = False). Is it useful to adjust the 'use_word = Ture'? Looking forward to your answer, thank you.