allenai / scifact

Data and models for the SciFact verification task.
Other
215 stars 24 forks source link

different result of 'rationale_selection/evaluation.py` #5

Closed neverneverendup closed 4 years ago

neverneverendup commented 4 years ago

The rationale_selection/evaluation.py script is used for producing metrics for table 2 in the paper if that is what are you looking for. However, table 2 measures only individual modules assuming other parts are oracle. Therefore, you should not run the full pipeline then use rationale_selection/evaluation.py because the full pipeline will use tfidf for document retrieval which already decreases performance. You should use oracle document retrieval then transformer rationale selection to reproduce results for table 2.

To reproduce table 2 rationale selection, the correct order to run scripts are abstract_retrieval/oracle.py -> rationale_selection/transformer.py -> rationale_selection/evaluate.py

To reproduce table 2 label prediction, the correct order to run scripts are abstract-retrieval/oracle.py --include-nei -> rationale_selection/oracle_tfidf.py -> label_prediction/transformers.py -> label_prediction/evalaute.py

Originally posted by @PeterL1n in https://github.com/allenai/scifact/issues/4#issuecomment-632826522

neverneverendup commented 4 years ago

I want to get the same result of rationale selection result F1 training on SciFact in table2 , but I get some problem.

What's the difference between the highest printed dev score in the training epoch and the result produced by rationale_selection/evaluate.py? I assume the former is the result in table 2 before...

But when using the right evaluataion way to test the rationale selection model you provided trained using roberta_large on SciFact ,I get this result: image Hit one: 0.867 Hit set: 0.8457 F1: 0.7402 Precision: 0.7571 Recall: 0.724

F1 is over 74, it doesn't match the rationale selection result trainging on SciFact in table 2. What's the problem?

dwadden commented 4 years ago

Sorry for the confusion. I've added a file rationale_selection/evaluate_paper_metrics.py which exactly reproduces the rationale selection numbers from Table 2 of the paper.

The difference is that, as described in Section 4.2 of the paper in the paragraph on "Sentence-level evaluation", a predicted rationale sentence is only correctly identified if all other sentences in its gold rationale are also among the predicted rationale sentences. This is enforced in evaluate_paper_metrics.py. In practice, the difference is small. Here's what I get:

{'precision': 0.7371428571428571, 'recall': 0.7049180327868853, 'f1': 0.7206703910614524}

Let me know if this doesn't work for you.

neverneverendup commented 4 years ago

Thank you! It really works! I get the same performance as yours.

BTW, does label prediction need a same metric enforced evaluation scripy either? I also want to test the label prediction model. Or should I just add a line of code in 'label_prediction/evaluate.py' to calculate the acc of the output of the label prediction model like below? Because I see the evalution metric of label prediction in tabel 2 is acc .

from sklearn.metrics import accuracy_score
print(f'acc: ', accuracy_score(true_labels, pred_labels))

Thanks 😄~