allenai / scifact

Data and models for the SciFact verification task.
Other
223 stars 25 forks source link

Different performance of rationale selection. #4

Closed EdwardZH closed 4 years ago

EdwardZH commented 4 years ago

I have tested the rationale selection part with the full pipeline. The relational selection performance is calculated with evaluation.py under the rationale_selection fold. Hit one: 0.42 Hit set: 0.41 F1: 0.434 Precision: 0.3612 Recall: 0.5437 It performs worse performance than your paper reported. Could you help me with this problem. Thank you very much.

PeterL1n commented 4 years ago

The rationale_selection/evaluation.py script is used for producing metrics for table 2 in the paper if that is what are you looking for. However, table 2 measures only individual modules assuming other parts are oracle. Therefore, you should not run the full pipeline then use rationale_selection/evaluation.py because the full pipeline will use tfidf for document retrieval which already decreases performance. You should use oracle document retrieval then transformer rationale selection to reproduce results for table 2.

To reproduce table 2 rationale selection, the correct order to run scripts are abstract_retrieval/oracle.py -> rationale_selection/transformer.py -> rationale_selection/evaluate.py

To reproduce table 2 label prediction, the correct order to run scripts are abstract-retrieval/oracle.py --include-nei -> rationale_selection/oracle_tfidf.py -> label_prediction/transformers.py -> label_prediction/evalaute.py

EdwardZH commented 4 years ago

Thank you for your help.