fudannlp16 / CWS_Dict

Source codes for paper "Neural Networks Incorporating Dictionaries for Chinese Word Segmentation", AAAI 2018
91 stars 32 forks source link

The same-domain experiments produce different test scores than the paper reported #6

Closed tianjianjiang closed 6 years ago

tianjianjiang commented 6 years ago

I've used python 2.7.12 andtensorflow-gpu 1.0.0 in Ubuntu 16.04 to try to reproduce same-domain experiments, but so far only obtained different (lower) test scores of PKU and MSR for model2. Please advise.

Some more info about my environment:

GabrielLin commented 6 years ago

It is a normal situation for a small difference. Please show your result for comparison.

tianjianjiang commented 6 years ago

Hi @GabrielLin,

I understand that it usually involves randomness. IMHO the differences are somewhat too big. For example,

I have tried to set random number seeds for data shuffle, numpy, and tensorflow. Additionally GPU has been fixed to one. Numbers still fluctuate.

Since PKU seems having only a few epochs, so I've tried continue training after early stopping. With 4 rounds, 6+33+43+5 epochs in total, it converges in my environment.

Train Epoch 6 loss 3.573669n 426.94 (sec) << Valid Epoch 6 loss 15.409616 P:0.964029 R:0.957178 F:0.960591 Test: P:0.957498 R:0.943251 F:0.950321 Best_F:0.962007 P:0.958340 R:0.951251 F:0.954782

Train Epoch 33 loss 0.737350n 434.54 (sec) << Valid Epoch 33 loss 10.526677 P:0.971944 R:0.968199 F:0.970068 Test: P:0.965097 R:0.954269 F:0.959653 Best_F:0.970744 P:0.961356 R:0.954586 F:0.957959

Train Epoch 43 loss 0.503710n 308.23 (sec) << Valid Epoch 43 loss 10.408177 P:0.972928 R:0.969850 F:0.971386 Test: P:0.966005 R:0.957805 F:0.961887 Best_F:0.971856 P:0.964821 R:0.959635 F:0.962221

Train Epoch 5 loss 1.404060n 303.74 (sec) << Valid Epoch 5 loss 7.886835 P:0.975593 R:0.973874 F:0.974733 Test: P:0.966675 R:0.957460 F:0.962046 Best_F:0.975528 P:0.968025 R:0.960104 F:0.964048

Train Epoch 5 loss 1.404064n 300.96 (sec) << Valid Epoch 5 loss 7.886860 P:0.975593 R:0.973874 F:0.974733 Test: P:0.966685 R:0.957479 F:0.962060 Best_F:0.975528 P:0.968025 R:0.960104 F:0.964048

fudannlp16 commented 6 years ago

Set initial lr=0.01 to train pku. The pre trained file for model2 has been updated

tianjianjiang commented 6 years ago

@fudannlp16 I see. So that was the step 2 "set the hyperparameter of config.py according to the paper" about, I failed to comprehend that before.

Setting lr=0.01 indeed improved F1 of PKU to the level according to the paper.

For MSR, however, I've only got it right one time out of ten. Since the difference between Model-I and Model-II for MSR is relatively small (97.8% - 97.6% = 0.2%), and the ten runs in my environment so far are ranging from 97.74% to 97.82%. IMHO the range is borderline acceptable. If your experiments had the same behavior, I will rest my case.

Thank you for all the support.