Closed horenbergerb closed 1 year ago
I have created a pastebin of llama_2hop.log in case you would like to try and replicate the issue: https://pastebin.com/7kAHbH5m
More data: These are all the values used to check whether each step in the example above is a wrong branch. Possibly an incorrect value here could explain why the steps aren't labeled as wrong branches, but I'm not really able to interpret the meaning of these values yet.
Proof step: 0
not is_useful: True
proof_step in axioms: True
type(proof_step) == fol.FOLForAll: False
last_step != None: False
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: False
Proof step: 1
not is_useful: True
proof_step in axioms: True
type(proof_step) == fol.FOLForAll: True
type(proof_step.operand) == fol.FOLIfThen: True
type(proof_step.operand.antecedent) == fol.FOLFuncApplication: True
last_step != None: True
proof_step.operand.antecedent.function == last_step.function: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True
Proof step: 2
not is_useful: True
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: False
last_step != None: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: False
Proof step: 3
not is_useful: True
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: True
type(proof_step.operand) == fol.FOLIfThen: True
type(proof_step.operand.antecedent) == fol.FOLFuncApplication: True
last_step != None: True
proof_step.operand.antecedent.function == last_step.function: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True
Proof step: 4
not is_useful: False
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: False
last_step != None: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True
I guess what it all comes down to is that step 0 isn't marked as a wrong branch because:
1) It's the first step 2) It's not a ForAll statement
Is this intended behavior? Step 0 heralds a sequence of reasoning which is tangential to the actual proof, and the end result is wrong. It seems to me that it should classify as a "wrong branch." Or is "correct and not useful" truly distinct from "wrong branch"?
Step 1 and onward can never be labeled wrong branch because step 0 was already not in the expected proof.
Sorry for the delayed response! Yes I think this is a bug that was introduced when we modified the proof evaluation function from v1 to v2 (from the original version to the "OOD" version of the dataset). We hadn't caught this since the experiments on v2 didn't rely on counting the number of misleading steps. I will take a look at this soon and fix it so that the behavior is the same as in v1.
I went through your example more carefully: It turns out this is not really a bug per se.
Yes, our definition of "misleading" was a bit narrower (we check for it here in the code). A misleading step is of the form for all x.(f(x) => g(x))
where the previous step is f(a)
and is part of the gold proof, and g(a)
is not part of the gold proof. But if you want to relax this definition, please feel free to modify this line. We only checked for universally-quantified formulas since the first version of the dataset focused on the modus ponens rule. We also didn't allow the first step to be misleading since we thought the model needs to be "misled" from the correct path in the proof, and if there are no previous steps, we can't tell if the model is on the correct path.
Thanks for the information! I was able to revert to v1 and run my experiments again. The statistics make more intuitive sense now. Whether it's a bug or a feature in v2, I think it's a little misleading to classify the example above as "invalid step" is the first mistake
. Maybe there ought to be a new category or something, since clearly first mistake was step 0, which is a valid step.
Also, does v2 correspond to a different paper than the original "Language Models are Greedy Reasoners"? It could be good to update that in the README. I'd like to see what kind of new functionality has been added so I can make use of it!
Thanks for all your work on this project! It's a fantastic idea and my experience with this repo has otherwise been quite smooth.
That makes sense. A new category would be a good idea, since I agree it shouldn't be an invalid step.
And yes, the OOD (v2) version of the dataset corresponds to a more recent paper (https://arxiv.org/abs/2305.15269). We're finishing up the camera-ready of this right now actually, and once that's done, we'll update the arxiv and add it to the readme.
Thank you for your interest in our work! ☺️
Hello! I am running this repo on some LLaMa models, and I noticed a discrepancy. By visual inspection, I confirmed that most incorrect proofs begin with a useless branch, but running
python analyze_results.py llama_2hop.log
was saying otherwise:Let's look at a particular example. This is a Q/A from llama_2hop.log:
I added some print statements in run_experiment.parse_log at line 1421 to see how each proof is classified. Here is the breakdown for this proof:
Isn't this a mistake? Clearly the first statement in the proof is a useless branch. I am not sure what is causing this classification issue. Any ideas?
I added a custom API for querying LLaMa models, but I can't see any reason it would affect parsing in this way. Thanks!