Closed neverneverendup closed 4 years ago
I want to get the same result of rationale selection result F1 training on SciFact in table2 , but I get some problem.
What's the difference between the highest printed dev score in the training epoch and the result produced by rationale_selection/evaluate.py? I assume the former is the result in table 2 before...
But when using the right evaluataion way to test the rationale selection model you provided trained using roberta_large on SciFact ,I get this result:
Hit one: 0.867
Hit set: 0.8457
F1: 0.7402
Precision: 0.7571
Recall: 0.724
F1 is over 74, it doesn't match the rationale selection result trainging on SciFact in table 2. What's the problem?
Sorry for the confusion. I've added a file rationale_selection/evaluate_paper_metrics.py which exactly reproduces the rationale selection numbers from Table 2 of the paper.
The difference is that, as described in Section 4.2 of the paper in the paragraph on "Sentence-level evaluation", a predicted rationale sentence is only correctly identified if all other sentences in its gold rationale are also among the predicted rationale sentences. This is enforced in evaluate_paper_metrics.py
. In practice, the difference is small. Here's what I get:
{'precision': 0.7371428571428571, 'recall': 0.7049180327868853, 'f1': 0.7206703910614524}
Let me know if this doesn't work for you.
BTW, does label prediction need a same metric enforced evaluation scripy either? I also want to test the label prediction model. Or should I just add a line of code in 'label_prediction/evaluate.py' to calculate the acc of the output of the label prediction model like below? Because I see the evalution metric of label prediction in tabel 2 is acc .
from sklearn.metrics import accuracy_score
print(f'acc: ', accuracy_score(true_labels, pred_labels))
Thanks 😄~
The
rationale_selection/evaluation.py
script is used for producing metrics for table 2 in the paper if that is what are you looking for. However, table 2 measures only individual modules assuming other parts are oracle. Therefore, you should not run the full pipeline then userationale_selection/evaluation.py
because the full pipeline will use tfidf for document retrieval which already decreases performance. You should use oracle document retrieval then transformer rationale selection to reproduce results for table 2.To reproduce table 2 rationale selection, the correct order to run scripts are
abstract_retrieval/oracle.py
->rationale_selection/transformer.py
->rationale_selection/evaluate.py
To reproduce table 2 label prediction, the correct order to run scripts are
abstract-retrieval/oracle.py --include-nei
->rationale_selection/oracle_tfidf.py
->label_prediction/transformers.py
->label_prediction/evalaute.py
Originally posted by @PeterL1n in https://github.com/allenai/scifact/issues/4#issuecomment-632826522