asaparov / prontoqa

Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
Apache License 2.0
118 stars 13 forks source link

analyze_results.analyze_log producing incorrect statistics #4

Closed horenbergerb closed 1 year ago

horenbergerb commented 1 year ago

Hello! I am running this repo on some LLaMa models, and I noticed a discrepancy. By visual inspection, I confirmed that most incorrect proofs begin with a useless branch, but running python analyze_results.py llama_2hop.log was saying otherwise:

Proportion of incorrect proofs where the "useless branch" is the first mistake: 0.19047619047619047
Proportion of incorrect proofs where the "useful skip step" is the first mistake: 0.0
Proportion of incorrect proofs where the "useless skip step" is the first mistake: 0.0
Proportion of incorrect proofs where the "useful non-atomic step" is the first mistake: 0.0
Proportion of incorrect proofs where the "useless non-atomic step" is the first mistake: 0.0
Proportion of incorrect proofs where the "invalid step" is the first mistake: 0.8095238095238095

Let's look at a particular example. This is a Q/A from llama_2hop.log:

Q: Vumpuses are aggressive. Vumpuses are shumpuses. Each vumpus is a tumpus. Each shumpus is rainy. Each shumpus is a sterpus. Each shumpus is a lempus. Every sterpus is not bright. Sterpuses are zumpuses. Every sterpus is a numpus. Grimpuses are bright. Each numpus is wooden. Every lempus is not sour. Tumpuses are not nervous. Lorpuses are opaque. Lorpuses are yumpuses. Alex is a lorpus. Alex is a shumpus. True or false: Alex is bright.
A:

Predicted answer: Alex is a lorpus. Each lorpus is a yumpus. Alex is a yumpus. Yumpuses are opaque. Alex is opaque. False

Expected answer: Alex is a shumpus. Each shumpus is a sterpus. Alex is a sterpus. Every sterpus is not bright. Alex is not bright.
 False

I added some print statements in run_experiment.parse_log at line 1421 to see how each proof is classified. Here is the breakdown for this proof:

Alex is a lorpus. Each lorpus is a yumpus. Alex is a yumpus. Yumpuses are opaque. Alex is opaque. False

correct steps: [0, 1, 2] correct and useful steps: [] redundant steps: [] unparseable steps: [] wrong branch steps: [] invalid steps: [3]

Isn't this a mistake? Clearly the first statement in the proof is a useless branch. I am not sure what is causing this classification issue. Any ideas?

I added a custom API for querying LLaMa models, but I can't see any reason it would affect parsing in this way. Thanks!

horenbergerb commented 1 year ago

I have created a pastebin of llama_2hop.log in case you would like to try and replicate the issue: https://pastebin.com/7kAHbH5m

horenbergerb commented 1 year ago

More data: These are all the values used to check whether each step in the example above is a wrong branch. Possibly an incorrect value here could explain why the steps aren't labeled as wrong branches, but I'm not really able to interpret the meaning of these values yet.

Proof step: 0
not is_useful: True
proof_step in axioms: True
type(proof_step) == fol.FOLForAll: False
last_step != None: False
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: False

Proof step: 1
not is_useful: True
proof_step in axioms: True
type(proof_step) == fol.FOLForAll: True
type(proof_step.operand) == fol.FOLIfThen: True
type(proof_step.operand.antecedent) == fol.FOLFuncApplication: True
last_step != None: True
proof_step.operand.antecedent.function == last_step.function: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True

Proof step: 2
not is_useful: True
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: False
last_step != None: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: False

Proof step: 3
not is_useful: True
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: True
type(proof_step.operand) == fol.FOLIfThen: True
type(proof_step.operand.antecedent) == fol.FOLFuncApplication: True
last_step != None: True
proof_step.operand.antecedent.function == last_step.function: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True

Proof step: 4
not is_useful: False
proof_step in axioms: False
type(proof_step) == fol.FOLForAll: False
last_step != None: True
last_step in expected_proof: False
type(last_step) == fol.FOLFuncApplication: True
horenbergerb commented 1 year ago

I guess what it all comes down to is that step 0 isn't marked as a wrong branch because:

1) It's the first step 2) It's not a ForAll statement

Is this intended behavior? Step 0 heralds a sequence of reasoning which is tangential to the actual proof, and the end result is wrong. It seems to me that it should classify as a "wrong branch." Or is "correct and not useful" truly distinct from "wrong branch"?

Step 1 and onward can never be labeled wrong branch because step 0 was already not in the expected proof.

asaparov commented 1 year ago

Sorry for the delayed response! Yes I think this is a bug that was introduced when we modified the proof evaluation function from v1 to v2 (from the original version to the "OOD" version of the dataset). We hadn't caught this since the experiments on v2 didn't rely on counting the number of misleading steps. I will take a look at this soon and fix it so that the behavior is the same as in v1.

asaparov commented 1 year ago

I went through your example more carefully: It turns out this is not really a bug per se.

Yes, our definition of "misleading" was a bit narrower (we check for it here in the code). A misleading step is of the form for all x.(f(x) => g(x)) where the previous step is f(a) and is part of the gold proof, and g(a) is not part of the gold proof. But if you want to relax this definition, please feel free to modify this line. We only checked for universally-quantified formulas since the first version of the dataset focused on the modus ponens rule. We also didn't allow the first step to be misleading since we thought the model needs to be "misled" from the correct path in the proof, and if there are no previous steps, we can't tell if the model is on the correct path.

horenbergerb commented 1 year ago

Thanks for the information! I was able to revert to v1 and run my experiments again. The statistics make more intuitive sense now. Whether it's a bug or a feature in v2, I think it's a little misleading to classify the example above as "invalid step" is the first mistake. Maybe there ought to be a new category or something, since clearly first mistake was step 0, which is a valid step.

Also, does v2 correspond to a different paper than the original "Language Models are Greedy Reasoners"? It could be good to update that in the README. I'd like to see what kind of new functionality has been added so I can make use of it!

Thanks for all your work on this project! It's a fantastic idea and my experience with this repo has otherwise been quite smooth.

asaparov commented 1 year ago

That makes sense. A new category would be a good idea, since I agree it shouldn't be an invalid step.

And yes, the OOD (v2) version of the dataset corresponds to a more recent paper (https://arxiv.org/abs/2305.15269). We're finishing up the camera-ready of this right now actually, and once that's done, we'll update the arxiv and add it to the readme.

Thank you for your interest in our work! ☺️