allenai / longformer

Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
Apache License 2.0
2.04k stars 275 forks source link

Intructions for Hyperpartisan preparation? #15

Open dvirginz opened 4 years ago

dvirginz commented 4 years ago

Great paper, and really clean and explainable repo, Thanks! Any plans to release the Hyperpartisan dataset and benchmark utils? It could really help future researchers go through your pipeline of cleaning and evaluating the data.

Thanks!

ibeltagy commented 4 years ago

We can't upload the data ourselves, but you can get access to it here. For the code, we are busy right now with the EMNLP deadline but can upload it afterwards.

dvirginz commented 4 years ago

Definitely! Good luck with EMNLP:) Whenever you'll have the time:)

JohannesTK commented 4 years ago

+1 on looking forward to Hyperpartisan code & hope EMNLP goes well!

OleNet commented 4 years ago

In the paper 8 of this paper, it is said that " For Hyperpartisan we split the training data into train/dev/test sets using standard 90/10/10 splits", I think there is a typo error about the split ratio.

I think the correct description should be the training data is splited into train/test sets using 90/10 splits, am I right?

OleNet commented 4 years ago

Another question is it is said that 'For Hyperpartisan we, ..., performed each experiment five times with different seeds to control variability associated with the small dataset', so how the final f1-score calculated?

Is the final score calculated by mean of best score of each run, or mean of last score of each run?

armancohan commented 4 years ago

@OleNet Re splits: Yes, 90/10/10 was a typo. We meant 80/10/10 (10% for dev and 10% for test). Re final F1: Final F1 score was calculated based on the mean of the test F1 scores from the each run. For each run we evaluated the checkpoint with the best dev performance.

OleNet commented 4 years ago

Got it, Thanks for your replying!

sjy1203 commented 4 years ago

Hi, I tested RoBERTa on randomly split Hyperpartisan dataset (80/10/10). The F1 score is 0.734370 (0.047247) for 6 runs. And I found the 1st team on leaderborad only got 0.809.

The above scores are much lower than the one reported in Longformer paper (0.874 for RoBERTa), and it puzzled me a lot. I guess the problem is due to the small size of Hyperpartisan dataset (only 645 samples).

image

Could you kindly provide the final train/dev/test data for this dataset? I think it's a key step to get consistent result and make fair comparison with your model.

armancohan commented 4 years ago

As described in the paper we split the original "training" set of this dataset into 3 parts. The dataset is small and using different splits could change the results considerably. We also did some preprocessing/cleaning on this data that could affect the results. We've added instructions, a preprocessing script and the exact splits we used. Please check out this PR: https://github.com/allenai/longformer/pull/112