Closed yangky11 closed 2 years ago
Hi,
I find a potential bug that may have led to the issue. Some proofs are not strictly trees. For example, the proof below contains both sent11 & sent9 -> int3
and sent11 & sent9 -> int6
.
sent17 & sent7 -> int1: a burner of a stove generates heat for cooking usually; int1 & sent13 -> int2: a burner of a stove will be hot in temperature; sent11 & sent9 -> int3: a pan and a burner are kinds of objects; int2 & int3 & sent21 & sent25 -> int4: a frying pan on the burner of the stove is an example of cooler object touching hot object; int4 & sent4 -> int5: thermal conduction will occur between the frying pan and the burner; sent11 & sent9 -> int6: a pan and a burner are kinds of objects; int5 & int6 & sent23 -> hypothesis
The align_conclusions_across_proofs
function does not work well for such proofs. Given the proof above as both the prediction and the ground truth, it produces a wrong mapping between intermediate sentences: {'int1': 'int1', 'int2': 'int2', 'int3': 'int3', 'int4': 'int4', 'int5': 'int5', 'int6': 'int3', 'hypothesis': 'hypothesis'}
(see int3
and int6
).
Hmm, I just realized this is not necessarily a bug in implementation, but a difficulty in how the alignment works conceptually. If we only aggregate the ancestors and calculate the Jaccard similarity (Appendix B), we cannot disambiguate int3
and int6
in the above case.
Hi,
Thanks for releasing the code! I have run it on prediction files and the results look correct. Just one minor issue that I'm curious about: I tried running the evaluation script on the ground truth validation proofs using
python eval/run_scorer.py --task "task_2" --split dev --prediction_file gts_val.tsv --output_dir ./ --bleurt_checkpoint bleurt-large-512
, wheregts_val.tsv
contains ground truth validation proofs:However, the result is not 100% for some metrics:
I haven't looked into the code in details yet, but I guess the issue may be related to the alignment process or the process for calculating BLEURT similarity—either some trees don't perfectly align with themselves, or some sentences have <0.28 BLEURT similarity with itself. If you happen to have any thoughts, I would be the most interested to hear. Thank you!