Poor performance on the test set of MNER-MI dataset

keaiwangdao commented 4 months ago

Dear Author, We are very interested in your work, so we retrained the model following your steps (by the way, it seems that your documentation does not specify the version of seqeval, and the version I installed is 1.2.2, I wonder if this has any impact on the replication). With the hyperparameters unchanged, after training for 15 rounds, the F1 score on the training set is 0.999, the F1 score on the validation set is 0.8088, and the F1 score on the test set in the final output is 0.7036. Below is part of the log output. Under the condition of random sampling, the results of the validation set and the test set should not differ so much. Is there any issue with the test set?

2024-07-13 12:40:22,668 - INFO -   Epoch 15/15, best train f1: 0.999,                            best train epoch: 15, current train f1 score: 0.999
2024-07-13 12:40:22,670 - INFO -   ***** Running evaluate *****
2024-07-13 12:40:22,670 - INFO -     Num instance = 864
2024-07-13 12:40:22,670 - INFO -     Batch size = 8
2024-07-13 12:41:10,956 - INFO -   ***** Dev Eval results *****
2024-07-13 12:41:10,957 - INFO -   
              precision    recall  f1-score   support

         LOC     0.8203    0.8642    0.8417       243
         MIS     0.6364    0.6917    0.6629       253
         ORG     0.6865    0.7257    0.7056       175
         PER     0.8945    0.8929    0.8937       560

   micro avg     0.7937    0.8221    0.8077      1231
   macro avg     0.7594    0.7936    0.7759      1231
weighted avg     0.7972    0.8221    0.8092      1231

2024-07-13 12:41:10,957 - INFO -   Epoch 15/15, best dev f1: 0.8088,                            best dev epoch: 13, current dev f1 score: 0.8077
2024-07-13 12:41:10,959 - INFO -   Get best dev performance at epoch 13, best dev f1 score is 0.8088
2024-07-13 12:41:10,959 - INFO -   The best max_f1 = 0.8088
2024-07-13 12:41:11,852 - INFO -   ***** Running test *****
2024-07-13 12:41:11,853 - INFO -     Num instance = 864
2024-07-13 12:41:11,853 - INFO -     Batch size = 8
2024-07-13 12:42:49,509 - INFO -   ***** Test results *****
2024-07-13 12:42:49,509 - INFO -   --- Test f1 score is 0.7036 ---

JinFish commented 4 months ago

The dataset is randomly split into training, validation, and test sets. The reason for a significant gap between the test set and the validation set may lie in the fact that the dataset encompasses Twitter data from 2019 to 2022, leading to substantial distributional differences across the entire dataset, which in turn results in considerable disparities between the validation and test sets. In our paper, we also mentioned, "We collect tweets from each month in the years 2019, 2020, 2021, and 2022 to provide a more diverse and unbiased dataset, which also makes it more challenging."

In our paper, we use grid search, where the learning rate ranged from 1e-5 to 7e-5, and the batch size was between 8 and 32. Have you run experiments according to these hyperparameters and selected the best outcome?

In the README of this repository, there was an issue with missing images in the dataset linked through the Baidu Cloud (the cause was related to downloading the dataset through VSCode, which tends to interrupt when transferring a large number of files, resulting in some images being absent in the dataset). If you find that many images in the dataset you downloaded cannot be opened or have a size of 0K, please download the latest dataset according to the most recent README file.

cementbarrier commented 1 month ago

@JinFish 作者你好，非常感谢你的详细解释，我正在尝试按照您的超参网格搜索重复验证，如果您还有最佳的参数记录的话，方便提供一下吗，这可能会帮我节省大量时间，我只需要您得到最好结果的那对学习率和batch_size就可以了。

JinFish commented 1 month ago

@JinFish 作者你好，非常感谢你的详细解释，我正在尝试按照您的超参网格搜索重复验证，如果您还有最佳的参数记录的话，方便提供一下吗，这可能会帮我节省大量时间，我只需要您得到最好结果的那对学习率和batch_size就可以了。

因为硬件和系统层面的差异，就算使用同样的随机种子也不能保证模型初始化和数据的采样上和我们是一致的。因此，我们得到的最好的超参数没有参考意义。但是我们提供的超参数的范围是根据优化器设计的，因此，在我们提供的参数范围内搜索是可以的。总的来说，您还是需要对超参数进行搜索。

cementbarrier commented 1 month ago

@JinFish 作者你好，非常感谢你的详细解释，我正在尝试按照您的超参网格搜索重复验证，如果您还有最佳的参数记录的话，方便提供一下吗，这可能会帮我节省大量时间，我只需要您得到最好结果的那对学习率和batch_size就可以了。

因为硬件和系统层面的差异，就算使用同样的随机种子也不能保证模型初始化和数据的采样上和我们是一致的。因此，我们得到的最好的超参数没有参考意义。但是我们提供的超参数的范围是根据优化器设计的，因此，在我们提供的参数范围内搜索是可以的。总的来说，您还是需要对超参数进行搜索。

好的非常感谢您

cementbarrier commented 1 month ago

@keaiwangdao 请问您之后对这个论文进行了超参调节吗，效果如何呢，如果能回答的话，非常感谢

JinFish / MNER-MI

Poor performance on the test set of MNER-MI dataset #1