I wanted to check why there seems to be such high variance on the test set for Text BERT. I reproduce the results here. Can I clarify which test set and val set (seen or unseen?) the results are for? Because I noticed the paper on Arxiv was updated last week, and it's unclear what really changed in the paper/the reason for the changes in the numbers reported.
I also ran inference using Text BERT and uploaded the csv to DrivenData to evaluate the model and got Acc 0.6020, AUROC 0.6552. This is for test seen.
Lastly, if the test set has changed from seen to unseen between v2 and v3 of the Arxiv paper, why is human accuracy still exactly the same at 84.70? Surely this is a weird coincidence?
Thanks for your time in answering my questions!
Lastly, how do I evaluate on the test sets (both seen and unseen) other than by uploading a csv to drivendata? It seems Phase 2 evaluations are closed.
The NeurIPS paper on arxiv covers the seen evaluation sets. The competition report, currently under review and coming out soon, will cover the unseen evaluation set.
Hi @apsdehal
I wanted to check why there seems to be such high variance on the test set for Text BERT. I reproduce the results here. Can I clarify which test set and val set (seen or unseen?) the results are for? Because I noticed the paper on Arxiv was updated last week, and it's unclear what really changed in the paper/the reason for the changes in the numbers reported.
I also ran inference using Text BERT and uploaded the csv to DrivenData to evaluate the model and got Acc 0.6020, AUROC 0.6552. This is for test seen.
Lastly, if the test set has changed from seen to unseen between v2 and v3 of the Arxiv paper, why is human accuracy still exactly the same at 84.70? Surely this is a weird coincidence?
Thanks for your time in answering my questions!
Lastly, how do I evaluate on the test sets (both seen and unseen) other than by uploading a csv to drivendata? It seems Phase 2 evaluations are closed.