Evaluation score lower than reported

Cartus / AGGCN

Attention Guided Graph Convolutional Networks for Relation Extraction (authors' PyTorch implementation for the ACL19 paper)

MIT License

432 stars 88 forks source link

Evaluation score lower than reported #7

Closed wzhouad closed 5 years ago

wzhouad commented 5 years ago

Hi, I retrain the model with 5 different random seeds on TACRED. However, the average F1 score is 67.116(+-0.121), which is much lower than the reported score in your paper. Is the default model config correct? Also, how large is the std in your experiments?

Cartus commented 5 years ago

Hi, May I ask what is your running environment (including hardware and software)? Under different environments, the performance of this model varies. If the environment is the same as what we describe in the Readme.md file, the current default settings should give you the model exactly the same as the pretrained model we released here.

Cartus commented 5 years ago

I have also uploaded the logs and the config.

If the running environment is the same as we described here, the output should be the same as in the logs.txt. The best model is the one we reported and released.

wzhouad commented 5 years ago

Hi, My python version is 3.6.5, pytorch is 1.1, CUDA is 10.0. I'm using GTX 1080Ti. I trained another 5 times, and the mean F1 is around 67.5% (+- 0.3%). I fully understand that the software and hardware will lead to different performance, but didn't expect so large difference. Also, could you tell me the mean and std of F1 score in your experiments? It's important for measuring the stability of the model and a concrete comparison to other methods.

Cartus commented 5 years ago

Sorry for the late reply.

Yes, I do test the model under similar settings as yours. It seems that the loss is different from the first epoch (1.254588 v.s. 1.24539). These minor differences will start to accumulate, which eventually lead to a different model (around 67.5%). For now, we couldn't figure it out the reason behind this. For the model stats, I will update you later, since I am kind of occupied by the visa stuff...

For the mean and std of F1 score in my experiments, the stats is 68.2% +- 0.5%. Thank you for pointing out this issue! We deeply appreciate that.

Also, we will update this score on our paper, for a fair an concrete comparison to other methods.

marchbnr commented 5 years ago

Hi, I have run the training as well and get similar results as reported by @wzhouad :

Final Score: Precision (micro): 70.780% Recall (micro): 63.308% F1 (micro): 66.836%

OS: openSUSE Leap 15.0. GPUs: RTX 2080 Ti cuda verison: 10 Python version: Python 3.6.8

Package Version
certifi 2019.6.16 cffi 1.12.3
mkl-fft 1.0.12
mkl-random 1.0.2
numpy 1.16.4
pip 19.1.1
pycparser 2.19
setuptools 41.0.1
torch 1.1.0
tqdm 4.32.2
wheel 0.33.4

Cartus commented 5 years ago

Hi @marchbnr ,

As stated in the Readme of this repo, we can't guarantee the performance of this repo when you run it under totally different settings (software and hardware). We also released the training log and pre-trained model.

For now, we might not able to find out the cause of this issue, since it involved too many variables (versions of GPU, CUDA, pytorch, etc,.)