tf1.6.0 not working, and did not get good NLL oracle result.

Hi,

First I tried tf 1.6.0 there was some complicated tf bug, so I switched to tf 1.13.1

After running for a night, I get the following (see below): I don't know whether it's over-fitting or other issues. I would be good if the authors can show (in the README), at what particular batch we can get a good oracle NLL.

Thanks!

batch: 87700 nll_oracle: 9.902623 batch: 87700 nll_test 7.6996207 mediator cooptrain iter#87700, balanced_nll 6.823340 mediator cooptrain iter#87710, balanced_nll 6.853920 mediator cooptrain iter#87720, balanced_nll 6.838597 mediator cooptrain iter#87730, balanced_nll 6.765410 mediator cooptrain iter#87740, balanced_nll 6.852599 mediator cooptrain iter#87750, balanced_nll 6.825665 mediator cooptrain iter#87760, balanced_nll 6.850584 mediator cooptrain iter#87770, balanced_nll 6.827829 mediator cooptrain iter#87780, balanced_nll 6.859410 mediator cooptrain iter#87790, balanced_nll 6.784107 batch: 87800 nll_oracle: 9.896609 batch: 87800 nll_test 7.7063065 mediator cooptrain iter#87800, balanced_nll 6.833647 mediator cooptrain iter#87810, balanced_nll 6.837624 mediator cooptrain iter#87820, balanced_nll 6.833254 cooptrain epoch# 563 jsd 6.7449245 mediator cooptrain iter#87830, balanced_nll 6.858107 mediator cooptrain iter#87840, balanced_nll 6.871158 mediator cooptrain iter#87850, balanced_nll 6.824977 mediator cooptrain iter#87860, balanced_nll 6.804533 mediator cooptrain iter#87870, balanced_nll 6.796575

Hi, The two issues may both be due to the implementation of tf.CuDNNLSTM. The reported performance is based on my manually implemented LSTM (as is implemented in generator.py, the older version has only one file for both modules). Not sure what is the difference between the two implementations. I'm still working on it, currently my suggestions are:

Check your CuDNN version to match your tf version if the training can't spawn normally.
Turn off the dropout for the mediator, or use higher dropout keeping rate (e.g. >=0.75). Please note that when the reported NLL_oracle, NLL_test are tested, the dropout is turned off to be consistent with other algorithms, and as training goes on, the calculated JSD could go up because of overfitting. If you would like to reproduce the results, this may be useful.
Initialize the mediator with tf.random_normal(0, 0.1) instead of using the default Xavier initializer of tf.CuDNNLSTM.

desire2020 / CoT

tf1.6.0 not working, and did not get good NLL oracle result. #4