Unable to reproduce the experimental results in the paper

LittleWhite0208 commented 1 year ago

Hello, I want to reproduce the experimental results in Table 4 of the paper. I download the ST and MJ datasets to train parseq and all hyperparameters are set to default（gpus:2，batch:384，max_epoch:20 and so on）. I eval the model on the 7,672 test samples, but it can't reach the accuracy in the paper. Under the condition of 36-char, the accuracy of ParseqA I got is 91.05, and the accuracy of ParseqN is 89.55, which are lower than 91.9 and 90.7 in Table 4 of the paper. I would like to ask whether other tricks were used in the paper to achieve the result？Thanks

baudm commented 1 year ago

Hello, sorry for the late reply. There are no other tricks used to produce the results in the paper. The default settings reflect the exact setup I used in my own experiments. However, during the course of my experiments, I found that the actual hardware used for training and for inference has some effect on the final results. For example, a PARSeq model trained on a V100 will always have a slightly lower mean word accuracy (~1% diff) vs another model trained on an A100. Furthermore, the exact same model weights, when used for inference on a V100 yields slightly lower word acc (~1% diff) vs the same model on A100.

Why does hardware affect the results? Honestly, I do not know the exact reason, but I have a hunch that it might have something to do with the floating-point representations supported: V100 has FP32 while A100 supports TF32. Furthermore, the smallness of the PARSeq-S model might make it more sensitive to differences in floating-point representations.

A good test would be to train bigger PARSeq models (base or bigger) on different hardwares and observe if the differences are still noticeable.

LittleWhite0208 commented 1 year ago

Thanks for your reply, but our model is also trained on A100. Are there any other possible reasons, like different data augmentation?

baudm commented 1 year ago

I see. How many training runs did you use to compute the mean word accuracy? It's possible that SWA could cause a bad run (by bad, I mean slightly worse than the expected performance). Also, did you use the MJ+ST linked here or obtained from a different source?

LittleWhite0208 commented 1 year ago

I trained for 20 epochs, and got the highest accuracy at the 17th epoch. The MJ+ST dataset used is linked here. The configuration file is the same as main.yaml.

baudm commented 1 year ago

Sorry for the late reply. The results presented in the paper (mean+/-std) were based on four random runs for PARSeq and all the reproduced models. Have you tried getting the aggregate results on several runs? Even with the training setup constant, the ~1% difference could be attributed to differences in the CUDA versions, PyTorch versions, NVIDIA drivers, OS stack and library versions, etc. and all other variables that we usually don't account for. So unless you're getting a very big difference (e.g. > 3% or something like that), I'd say your setup is fine and the result you're getting is within expectations.

Apostatee commented 2 months ago

I reproduced the experimental results, however when using 4 gpus, it dropped 5 points!

baudm / parseq

Unable to reproduce the experimental results in the paper #70