The same-domain experiments produce different test scores than the paper reported

tianjianjiang commented 6 years ago

I've used python 2.7.12 andtensorflow-gpu 1.0.0 in Ubuntu 16.04 to try to reproduce same-domain experiments, but so far only obtained different (lower) test scores of PKU and MSR for model2. Please advise.

Some more info about my environment:

GeForce GTX 1080 Ti * 2
CUDA 8.0.61
CuDNN 5.1.10

GabrielLin commented 6 years ago

It is a normal situation for a small difference. Please show your result for comparison.

tianjianjiang commented 6 years ago

Hi @GabrielLin,

I understand that it usually involves randomness. IMHO the differences are somewhat too big. For example,

MSR
- Last epoch
  
  Train Epoch 29 loss 0.426792n 446.87 (sec) << Valid Epoch 29 loss 2.683211 P:0.974043 R:0.976689 F:0.975364 Test: P:0.975820 R:0.979153 F:0.977484 Best_F:0.976075
- Final test
  
  P:0.976320 R:0.978713 F:0.977515
PKU
- Last epoch
  
  Train Epoch 6 loss 3.556927n 849.97 (sec) << Valid Epoch 6 loss 15.523698 P:0.964023 R:0.955136 F:0.959559 Test: P:0.956657 R:0.940425 F:0.948472 Best_F:0.960828
- Final test
  
  P:0.956911 R:0.949201 F:0.953041

I have tried to set random number seeds for data shuffle, numpy, and tensorflow. Additionally GPU has been fixed to one. Numbers still fluctuate.

Since PKU seems having only a few epochs, so I've tried continue training after early stopping. With 4 rounds, 6+33+43+5 epochs in total, it converges in my environment.

Train Epoch 6 loss 3.573669n 426.94 (sec) << Valid Epoch 6 loss 15.409616 P:0.964029 R:0.957178 F:0.960591 Test: P:0.957498 R:0.943251 F:0.950321 Best_F:0.962007 P:0.958340 R:0.951251 F:0.954782

Train Epoch 33 loss 0.737350n 434.54 (sec) << Valid Epoch 33 loss 10.526677 P:0.971944 R:0.968199 F:0.970068 Test: P:0.965097 R:0.954269 F:0.959653 Best_F:0.970744 P:0.961356 R:0.954586 F:0.957959

Train Epoch 43 loss 0.503710n 308.23 (sec) << Valid Epoch 43 loss 10.408177 P:0.972928 R:0.969850 F:0.971386 Test: P:0.966005 R:0.957805 F:0.961887 Best_F:0.971856 P:0.964821 R:0.959635 F:0.962221

Train Epoch 5 loss 1.404060n 303.74 (sec) << Valid Epoch 5 loss 7.886835 P:0.975593 R:0.973874 F:0.974733 Test: P:0.966675 R:0.957460 F:0.962046 Best_F:0.975528 P:0.968025 R:0.960104 F:0.964048

Train Epoch 5 loss 1.404064n 300.96 (sec) << Valid Epoch 5 loss 7.886860 P:0.975593 R:0.973874 F:0.974733 Test: P:0.966685 R:0.957479 F:0.962060 Best_F:0.975528 P:0.968025 R:0.960104 F:0.964048

fudannlp16 commented 6 years ago

Set initial lr=0.01 to train pku. The pre trained file for model2 has been updated

tianjianjiang commented 6 years ago

@fudannlp16 I see. So that was the step 2 "set the hyperparameter of config.py according to the paper" about, I failed to comprehend that before.

Setting lr=0.01 indeed improved F1 of PKU to the level according to the paper.

For MSR, however, I've only got it right one time out of ten. Since the difference between Model-I and Model-II for MSR is relatively small (97.8% - 97.6% = 0.2%), and the ten runs in my environment so far are ranging from 97.74% to 97.82%. IMHO the range is borderline acceptable. If your experiments had the same behavior, I will rest my case.

Thank you for all the support.

fudannlp16 / CWS_Dict

The same-domain experiments produce different test scores than the paper reported #6