increasing the dept of the model

kadir-gunel commented 7 years ago

Hello @SwordYork ,

Are these parameters used for determining the depth of the model ?

config['src_dgru_depth'] = 1
config['bidir_encoder_depth'] = 2 # crucial
config['transition_depth'] = 1
config['trg_dgru_depth'] = 1
config['trg_igru_depth'] = 1

I am planning to deepen the model. Which one would you suggest :

adding additional layers for both encoder and decoder
or increment number of hidden unit size?

I have been training more than 10 days with nearly 1M parallel corpora but when I evaluate the model, after 1M iterations, with test set I get 0 for Blue which seems awkward. Because I have trained once with a different parallel corpora (nearly 20M sentence pairs) and at least the 1-gram evaluation in Blue was around 26.5 and the overall BLEU near 3.50 .

Also I noticed, after 3 days, that in the source language the number of chars is 110, and the target language has 220 chars. And I set both to 120. :disappointed: . Could this be the root of the failure ? (But in terms of morphology, interestingly, word translations look fine.)

Do you think continuing training is a good idea? Or should I start from scratch with proper char numbers ?

B.R. Kadir

SwordYork commented 7 years ago

Hi, I think there must be something wrong, the BLEU score should not be 0. Could you plot the training curve or the BLEU on some training examples ? Continuing training won't be helpful.

kadir-gunel commented 7 years ago

My bad. I tried to use multiple references during test which evaluated but at the end it gave 0.

Anyway, now I did with single reference and I am getting fine results.

But in any case, could you tell me how a new layer could be added for deepening the network?

SwordYork commented 7 years ago

In my experiments, I found that multi-layer BiRNN is crucial. For example, 2 layer BiRNN encoder would result +2 BLEU. Never use multi-layer decoder. But increasing hidden units of decoder will be helpful.

PS. I found that allow_gc = True consumes a lot of memory. The default options config.scan.allow_gc=True is enough for this code. Thanks!

Could you please share the model and the test file?

SwordYork commented 7 years ago

You may replace the hierarchical decoder with the naive character-level decoder or decoding at bpe level.

We are going to write it in tensorflow, it would speed up a lot.

kadir-gunel commented 7 years ago

Thank you @SwordYork.

Unfortunately, I can not share neither the test file nor the model (these are paid by my company and I am not allowed to share them). I am really sorry and I am telling this with all my heart.

People use open source software tools and when time comes to give something to the community most majority of it does not give a thing, I can totally understand your situation here.

On the other hand, what I can do is to contribute your new implementation process. When do you plan to implement it?

Sincerely Kadir

SwordYork commented 7 years ago

Never mind. I will complete the Tensorflow version within a month. But I will release a character-level language model with in a week. You could help me test it and improve it. Thanks!

kadir-gunel commented 7 years ago

Oh, great !

Sure, I can.

In the paper, you mentioned about the parameter size as millions, and in the code before training starts a lot of information is printed out and one of them is called 'total number of parameters : 87 ' . Does this mean 87 millions of parameters ?

SwordYork commented 7 years ago

Thank you.

There are totally 87 blocks of parameters, for example, a weight matrix is a block of parameters. The total number of parameters could be calculated based on the size of each block, which also has been printed before training.

kadir-gunel commented 7 years ago

Hello @SwordYork ,

I am trying to deepen the model by using the config file. The parameters that are inside the config file seems like they belong to your previous paper; at least their names.

Could you please clarify the parameters that are model related ?

Thank you in advance

SwordYork commented 7 years ago

config['src_dgru_depth'] = 1
config['bidir_encoder_depth'] = 2 # crucial
config['transition_depth'] = 1
config['trg_dgru_depth'] = 1
config['trg_igru_depth'] = 1

All these parameters are related. But the deeper model is much harder to train. Thus we think the default config performs best (empirically).

kadir-gunel commented 7 years ago

I understand. The reason that I want to deepen the model is due to the data size that I have (It is nearly 1M of lines.) If I leave every parameter as you left in the config file then the blue score is near 4! I get this score after 70 epochs after 20 days of training! When I plot the training error, it seems like the model continued to learn until nearly 1000000 iterations but the score is really low.

My main problem is how can I make the model learn properly this data ? Changing the learning rate could help? Or should I infer directly that data size is the problem ? Did you experiment the model with such data size before ?

B.R.

SwordYork commented 7 years ago

Hi, I think the data size is the problem. 1M lines are not enough, the model is overfitting. The larger datasets should use the deeper models. Changing the learning rate could not help, however, you may try dropout in the non-recurrent connections. I have not try such data size before.

In my opinion, it is better to use word-level model if the dataset is small. Because the word itself is a strong prior, it may deal with overfitting. Thanks.

SwordYork / DCNMT

increasing the dept of the model #11