apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Question about LSTM implementation: Perplexity convergence differs between lstm_bucketing.py and rnn_cell_demo.py #4774

Closed YujiOshima closed 7 years ago

YujiOshima commented 7 years ago

For bugs or installation issues, please provide the following information. The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: Ubuntu 14.04.03 (running on docker. docker host is Ubuntu 16.04)

Compiler: gcc 4.8.4

Package used (Python/R/Scala/Julia): Python

MXNet commit hash (git rev-parse HEAD): b6e8eec8b94c70d9e116b3a4443ce75ce3e07aa2

If you are using python package, please provide

Python version and distribution: Python 2.7.6

Question

I think that the following three objects implement the same purpose differently.

But the results of Perplexity convergence are different.

bucket of len  32 : 52318 samples
Summary of dataset ==================
bucket of len  32 : 4131 samples
eprecation Warning] mxnet.model.FeedForward has been deprecated. Please use mxnet.mod.Module instead.
[07:36:30] src/operator/tensor/./matrix_op-inl.h:155: Using target_shape will be deprecated.
2017-01-23 07:36:30,132 Start training with [gpu(0)]
[07:36:30] src/operator/tensor/./matrix_op-inl.h:155: Using target_shape will be deprecated.
[07:36:30] src/operator/tensor/./matrix_op-inl.h:155: Using target_shape will be deprecated.
2017-01-23 07:36:36,009 Epoch[0] Batch [50]    Speed: 326.78 samplesˇ/sec    Train-Perplexity=819.327175
2017-01-23 07:36:41,052 Epoch[0] Batch [100]    Speed: 317.28 samples/sec    Train-Perplexity=37.890269
2017-01-23 07:36:46,056 Epoch[0] Batch [150]    Speed: 319.73 samples/sec    Train-Perplexity=30.136881
2017-01-23 07:36:51,061 Epoch[0] Batch [200]    Speed: 319.74 samples/sec    Train-Perplexity=27.374816
2017-01-23 07:36:56,057 Epoch[0] Batch [250]    Speed: 320.25 samples/sec    Train-Perplexity=24.731618
2017-01-23 07:37:01,096 Epoch[0] Batch [300]    Speed: 317.56 samples/sec    Train-Perplexity=23.069615
2017-01-23 07:37:06,104 Epoch[0] Batch [350]    Speed: 319.46 samples/sec    Train-Perplexity=25.119809
2017-01-23 07:37:11,156 Epoch[0] Batch [400]    Speed: 316.76 samples/sec    Train-Perplexity=23.873587
2017-01-23 07:37:16,182 Epoch[0] Batch [450]    Speed: 318.30 samples/sec    Train-Perplexity=22.034268
2017-01-23 07:37:21,158 Epoch[0] Batch [500]    Speed: 321.57 samples/sec    Train-Perplexity=21.762741
2017-01-23 07:37:26,070 Epoch[0] Batch [550]    Speed: 325.80 samples/sec    Train-Perplexity=20.518414
2017-01-23 07:37:31,077 Epoch[0] Batch [600]    Speed: 319.56 samples/sec    Train-Perplexity=22.382877
2017-01-23 07:37:36,062 Epoch[0] Batch [650]    Speed: 320.98 samples/sec    Train-Perplexity=20.621223
2017-01-23 07:37:41,014 Epoch[0] Batch [700]    Speed: 323.08 samples/sec    Train-Perplexity=21.058044
.
.
.
2017-01-23 07:44:21,580 Epoch[2] Batch [1300]    Speed: 321.41 samples/sec    Train-Perplexity=17.281973
2017-01-23 07:44:26,553 Epoch[2] Batch [1350]    Speed: 321.74 samples/sec    Train-Perplexity=14.715190
2017-01-23 07:44:31,533 Epoch[2] Batch [1400]    Speed: 321.29 samples/sec    Train-Perplexity=16.221104
2017-01-23 07:44:36,559 Epoch[2] Batch [1450]    Speed: 318.40 samples/sec    Train-Perplexity=15.390250
2017-01-23 07:44:41,632 Epoch[2] Batch [1500]    Speed: 315.39 samples/sec    Train-Perplexity=15.445390
2017-01-23 07:44:46,598 Epoch[2] Batch [1550]    Speed: 322.18 samples/sec    Train-Perplexity=14.912412
2017-01-23 07:44:51,602 Epoch[2] Batch [1600]    Speed: 319.79 samples/sec    Train-Perplexity=15.044475
2017-01-23 07:44:54,991 Epoch[2] Resetting Data Iterator
2017-01-23 07:44:54,991 Epoch[2] Time cost=162.703
2017-01-23 07:45:02,795 Epoch[2] Validation-Perplexity=15.626726

The parameters are as follows at all implementation.

    buckets = [32]
    num_hidden = 200
    num_embed = 200
    num_lstm_layer = 2

    num_epoch = 3
    learning_rate = 0.01
    momentum = 0.0

What is the difference between these three implementations?

mz24cn commented 7 years ago

rnn_cell_demo.py named built in RNN parameters as "LSTM_bias", which is initialized as all zero. It significantly reduces the convergence speed. I changed the initializer and acquired similar convergence speed like others.

YujiOshima commented 7 years ago

Thank you @mz24cn ! Since I do not know how to initialize the bias to sym.RNN, can you show me if you have sample code to initialize the bias?

mz24cn commented 7 years ago

I will commit my code in next several days.

YujiOshima commented 7 years ago

Great! It's a big help. I am looking forward to your commit.

mz24cn commented 7 years ago

I have submitted a PR: https://github.com/dmlc/mxnet/pull/4819

phunterlau commented 7 years ago

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!