asaparov / prontoqa

Synthetic question-answering dataset to formally analyze the chain-of-thought output of large language models on a reasoning task.
Apache License 2.0
115 stars 12 forks source link

Raising the difficulty even more? #3

Closed alexconstant9108 closed 10 months ago

alexconstant9108 commented 1 year ago

@asaparov Thank you for your great work on the dataset and generator! What is a good way to further increase the difficulty of generated problems? Just increase the number of hops from 5 to say 20 or is there a better way?

Also I am curious is there any software system out there (LLM based or not) that can handle all of the most difficult cases?

asaparov commented 1 year ago

No worries, and thank you for your interest! Yes increasing proof depth is an easy way to increase difficulty. Increasing proof width is another way, but models seem to be better at handling wider proofs than deeper ones. Another way to increase complexity is composing multiple different rule types, which can be done using --deduction-rule=Composed. You can control the depth of these compositional examples using the same --min-hops and --max-hops flags. You can also set the minimum number of distinct deduction rule types of each example using the flag --rule-types=4.

Though if you do try to generate larger examples, you will need to increase the number of available concept names (here) and adjectives (here). Otherwise, if you ask to generate a larger ontology and it fails due to insufficient concept names, it will try again repeatedly and go into an infinite loop.

To your second question: LLMs are not able to solve problems with many many hops. This dataset was designed to test such models. However, classical/symbolic methods would be able to solve them pretty easily, since the grammar of the examples is very regular and simple, it is easy to convert them into symbolic logical forms, and then use classical theorem provers to solve them.

alexconstant9108 commented 1 year ago

Hello again @asaparov :) I was kinda curious where GPT3 fails at deduction and started looking at the examples in the file gpt_textdavinci003_4hop_OOD_Composed_random_noadj.log

I noticed some logical contradictions in the first two examples and was curious if it is a bug in the generator script or there is some other explanation. Could those contradictions in the premises confused gpt3 in its reasoning?

Test example 1:

Q: Numpuses are gorpuses. Fae is a lempus. Every shumpus is a gorpus. Fae is a gorpus. Every impus is a wumpus. Fae is a rompus. Each yumpus is a wumpus. Everything that is a lorpus and a gorpus and a lempus is a tumpus. Fae is a lorpus. Rompuses are sterpuses. Everything that is a lorpus, a zumpus, and a brimpus is a shumpus and a jompus and an impus. Everything that is a lorpus and a zumpus and a brimpus is a lempus. Jompuses are dumpuses. Every shumpus is a sterpus. Fae is a yumpus and Fae is a numpus and Fae is not a sterpus. Fae is a tumpus, a grimpus, and a lempus. Fae is a lorpus. Fae is a brimpus. Prove: Fae is not a zumpus.

------------ Analisys:
Fae is a rompus. Rompuses are sterpuses. => Fae is a sterpus.
Fae is a yumpus and Fae is a numpus and Fae is not a sterpus. => CONTRADICTION

Test example 2:

Q: Each grimpus is a numpus. Every yumpus is a jompus. Everything that is a lempus or a gorpus or a rompus is a zumpus. Everything that is a sterpus, a wumpus, or a gorpus is a lorpus. Every grimpus is a gorpus. Everything that is a sterpus or a wumpus or a gorpus is a numpus. Every grimpus is a jompus. Every lorpus is a zumpus. Lorpuses are grimpuses and yumpuses and sterpuses. Wren is a brimpus. Wren is a grimpus. Prove: Wren is a sterpus.

------------
Every grimpus is a gorpus. Everything that is a sterpus, a wumpus, or a gorpus is a lorpus. => Every grimpus is a lorpus.

Lorpuses are grimpuses and yumpuses and sterpuses. => Lorpuses are grimpuses  => CYCLE

What do you think?

asaparov commented 1 year ago

Oh nice catch! Contradictions could have confused the LMs. Though interestingly, in the results we had found that the gap between in-distribution and out-of-distribution compositional examples (i.e. "Composed") was not significant (see Fig 6, min depth = 4) (In fact for PaLM and LLaMA, the OOD performance was better than ID!).

The cycles you find in test example 2 are not contradictions, and do arise in real reasoning problems. In this case, you can prove that anything is a grimpus if and only if it is a gorpus, etc. But it would be interesting to test if reasoning performance differs significantly on problems with vs without cycles.

alexconstant9108 commented 1 year ago

Interesting... @asaparov I wanted to tweak the run_experiment script and generate a 10 or even 20 hop "monster" query to test the capabilities of GPT-4. And I would assume that even the most advanced ATPs have their limits and will struggle with such a big query. But first I need to make sure that there are no contradictions at least of the type of the one in the first example. I also found another contradiction in another example:

Q: Each jompus is a gorpus. Grimpuses are numpuses. Every shumpus is a lorpus, a vumpus, and a gorpus. Brimpuses are lempuses. Vumpuses are dumpuses. Tumpuses are grimpuses, sterpuses, and shumpuses. Rompuses are dumpuses. Tumpuses are numpuses. Sterpuses are zumpuses. Each shumpus is a jompus. Wren is a jompus. Every lorpus is a jompus. Wren is a numpus, a lempus, and a wumpus. Wren is a brimpus, Wren is not a gorpus, and Wren is a rompus. Prove: Wren is not a tumpus.

Here it follows that Wren is a gorpus, but then it is stated that Wren is not a gorpus.

asaparov commented 1 year ago

I fixed a bug that was causing these contradictory outputs. They only would appear in certain compositional proofs (i.e. with --deduction-rule Composed). Try with the new code and let me know if it still produces contradictions.

alexconstant9108 commented 1 year ago

Thanks for the fix!! :) I only did a quick test with 2 hops and just a couple of examples and it seems like there are no contradictions so far. I'll generate some examples with more hops during the weekend to test the generator further.

Some notes:

With --ontology true it results in error: UnboundLocalError: local variable 'irrelevant_entity' referenced before assignment

I'd like to enable more deduction rules. What are the toggles for AndIntro, AndElim, etc. rules or do i need to tweak the run_experiment.py script directly?

asaparov commented 1 year ago

About the warnings: Hmm, 100 properties and 150 concepts is quite a lot, especially if only generating for 2 hops as in your example. Are there 100 properties grouped into families? If so, how many families are there? Perhaps could you send me your modified lists of properties and concept names and I can test it on my end to see what's going on.

On the error with --ontology true: I pushed a fix for this, but I noticed that the true ontology only works if "distractors" and "test-distractors" are both set to "none". We haven't implemented the new distractors for the true ontology. Would this functionality be important to your application?

On using other deduction rules: You should only need to set --deduction-rule <rule> where <rule> is any of ModusPonens, AndIntro, AndElim, OrIntro, OrElim, ProofByContra, Composed.

alexconstant9108 commented 1 year ago

Here is the modified .py script with the additional adjectives, relevant and irrelevant concepts. I just wanted a high enough number of each for when I try to generate examples with more than 10 hops.

https://pastebin.com/6JpY2RQc

alexconstant9108 commented 1 year ago

After some more experimenting it looks like ontology=false with up to 4 hops seems to work fine. Command line: hops=4 trials=1 python3 ./run_experiment.py --model-name json --model-size 1 --ordering random --deduction-rule ModusPonens --num-trials ${trials} --few-shot-examples 1 --ontology false --min-hops ${hops} --max-hops ${hops}

With 5 or more hops it gets into an infinite loop printing lots of warnings: WARNING: Could not extend ontology due to insufficient property families. WARNING: Could not extend ontology due to insufficient concept names. Should I generate even more than 150 concepts and adjectives? I thought those should be enough...

If I try 4 hops with any of the other deduction rules I get one of the following: -> raise Exception("formula_to_np_prime ERROR: Unsupported formula type ({}).".format(type(formula))) or

"proof.py", line 407, in generate_de_morgans_question assert(num_deduction_steps - 1 == 1) AssertionError

After the last fix --ontology true even with 1 hop seems to go into an infinite loop but I am fine with --ontology false so this is not a problem...

asaparov commented 1 year ago

So the reason for this is actually, in your code, you added the extra concept names to available_concept_names but only when ontology == "false". If you want to use --ontology fictional, you should also add them to available_concept_names on line 639.

Also, I noticed that while you added many properties, you still only have 12 property families. It's important to note that available_property_families is a list of lists (i.e. a list of property families, with each family containing many properties). It was designed this way to avoid generating things like "All wumpuses are hot. Wumpuses are dumpuses. All dumpuses are cold." So "hot" and "cold" are grouped into the same property family, and each family may only be used to describe sibling nodes in the ontology.

After splitting each of your property families into 2, I had 24 property families, and this was sufficient to generate fictional and false examples with depths 5 and 10.

On the other deduction rules: We only implemented support for them with --proofs-only. So instead of asking the model to prove if a query is true or false, if --proofs-only is selected, then it will ask the model to output the proof of a given query. And this proof can be checked using analyze_results.py.

On the infinite loop with --ontology true: This is because you don't have --distractors none --test-distractors none (I mentioned above that we haven't implemented the new distractors for the true ontology).

I've added an error messages for all of the above unsupported cases. But let me know if you really need any of these unimplemented features.

alexconstant9108 commented 1 year ago

Thanks for your patience and the helpful info :+1:

I've increased the number of properties up to 150 families and 15 properties each and I am finally clear of warnings at 10 hops :1st_place_medal:

Not so lucky with the insufficient concept names warning :( I went up to 500 concept names (copied them for the fictional ontology as well although for now I only use ontology=false) but I still get some warnings which is weird. I get the warnings even at 2 hops, but the file gets generated so maybe false warnings???

The --proofs-only tip and the extra error messages really did the job for the rest :1st_place_medal:
Thanks!

asaparov commented 12 months ago

Thank you for your interest in this work and for your feedback!

Oh it's fine to see some warnings. The generative process is random so it's not uncommon to exhaust the concept/property names. You just want to make sure you have enough that it will sometimes generate valid ontologies. So long as the code terminates and produces correct output, you're good.

Hmm you probably want to use a fictional ontology if you want to fairly evaluate the reasoning ability of models. But I guess since you're replacing the concept names with fictional names anyway, the false ontology is equivalent to the fictional one.

alexconstant9108 commented 11 months ago

Hi again @asaparov I am now playing with the generator script and the other possible deduction-rules ModusPonens and AndIntro seem to work fine but AndElim, OrIntro, OrElim, ProofByContra, Composed throw Error: Unsupported formula type. I use the same command line as above but just change --deduction-rule value Is there another parameter that I need to change also?

asaparov commented 11 months ago

I suspect there's a mismatch in the requested proof width/depth and the deduction rule (e.g. AndElim requires width >= 2, etc). What's the full command that produces that error?

alexconstant9108 commented 11 months ago

hops=15 trials=1 deductionRule=AndElim python3 ./run_experiment.py --proofs-only --model-name json --model-size 1 --ordering random --deduction-rule ${deductionRule} --num-trials ${trials} --few-shot-examples 1 --ontology false --min-hops ${hops} --max-hops ${hops}

1) Results in raise Exception("formula_to_np_prime ERROR: Unsupported formula type ({}).".format(type(formula))) 2) The same exception with OrIntro.

3) OrElim and ProofByContra lead to assert num_deduction_steps - 1 == 1 4) Composed leads to infinite loop without any errors / warnings

asaparov commented 11 months ago

We did not implement support for OrElim and ProofByContra with depth > 1, because any proof containing multiple sequential (i.e. nested) steps of disjunction elimination (proof by cases) or proof by contradiction would be very complicated, even for a human to reason about. It's not obvious to me how you would even construct such a proof without resorting to many other types of deduction rules (it would be difficult to control for deduction rule type). In our experiments, in order to increase the complexity of these examples, we instead increased the proof width.

For AndElim and OrIntro, we hadn't implemented support for generating sentences from certain kinds of logical forms when --no-adjectives was not specified. I pushed a change to add this support, but maybe using --no-adjectives would be better so that your examples are more comparable to those in our experiments.

For Composed, my guess is this is due to the rejection sampling method we use to generate compositional proofs. We first use a naive method to generate a proof and then check if it is consistent. If not, we try again. And I think the probability of generating an inconsistent proof goes up exponentially with # of hops, since our proof generation is very simple and naive. I suspect that we would need to make the proof generation algorithm more sophisticated to avoid this (i.e. avoid generating inconsistent axioms). Can you send me your modified run_experiment.py so I can test this?

alexconstant9108 commented 11 months ago

I see. That makes sense. As curiosity I wanted to test the boundaries of both human and machine (LLMs and ATPs) deduction capabilities because there has been a lot of debates lately whether for example the current breed of LLMs can reason. Which has raised the question whether most of humans can do it correctly as well... A fun and interesting topic on its own...

because any proof containing multiple sequential (i.e. nested) steps of disjunction elimination (proof by cases) or proof by contradiction would be very complicated, even for a human to reason about.

If you manage to implement such functionality in the generator (support for depth > 1 with OrElim / ProofByContra), then I will do my best to verify if it generates correct proofs or not and then test the current SOTA LLMs and ATPs how they are able to handle such queries.

I pushed a change to add this support

Thanks I will test it a bit later.

maybe using --no-adjectives would be better

I think adjectives help increase the difficulty so I will keep them in

Attached the .py file I am using

https://pastebin.com/JxwxHq5N

alexconstant9108 commented 11 months ago

For AndElim and OrIntro, we hadn't implemented support for generating sentences from certain kinds of logical forms when --no-adjectives was not specified. I pushed a change to add this support,

with your latest patch I tested a newly generated 25hop_AndElim_1shot_random_falseontology.json - seems to work fine :1st_place_medal:

With any hops and OrIntro, I am getting: `WARNING: Could not extend ontology due to insufficient concept names.

syntax.py(39)formula_to_np_prime() -> raise Exception("formula_to_np_prime ERROR: Unsupported formula type ({}).".format(type(formula)))`

asaparov commented 11 months ago

After testing with your code, it is indeed the case that the algorithm for generating compositional proofs is too naive to generate proofs with very large depth (the rejection rate increases exponentially with depth).

I'm sorry I don't currently have the time to implement generating OrElim, ProofByContra, or Composed proofs for large depth. If I do implement it in the future, I will let you know.

I found bugs related to converting logical forms into sentences when --no-adjectives is not specified. I pushed another fix for this.

Just a heads up: --no-adjectives doesn't remove information; it only changes how properties are specified. So for example, the logical form for all x(big(x) & red(x) & dog(x) => mammal(x)) would be converted into something like "Everything that is big, red, and a dog is a mammal." "Big red dogs are mammals" otherwise. So I'm not sure which is really more difficult.

Also, I do recommend you use the fictional ontology rather than false since your custom concept names are all fictional anyway, and it is more consistent with the naming we use.

alexconstant9108 commented 11 months ago

I'm sorry I don't currently have the time to implement generating OrElim, ProofByContra, or Composed proofs for large depth. If I do implement it in the future, I will let you know.

I understand. It's likely not a trivial thing to implement. Just be sure to let me know if / when it is ready (it would likely be a publishable piece of work once implemented). Thanks again for your hard work! The generator is already amazing as is ))

"Everything that is big, red, and a dog is a mammal." "Big red dogs are mammals" otherwise. So I'm not sure which is really more difficult.

Both forms are equivalent. Its up to the NLP parser which will be harder to parse. I guess Big red dogs would require more semantic understanding of what an adjective / noun is..

Also, I do recommend you use the fictional ontology rather than false since your custom concept names are all fictional anyway, and it is more consistent with the naming we use.

Yea that makes sense. I just put the generated concepts in the first place I saw which turned out to be not in tune with the naming of the ontologies..

asaparov commented 11 months ago

If you give me examples of OrElim and ProofByContra proofs with depth > 1, I can use them to think about the necessary changes to the generation algorithms for those cases.

alexconstant9108 commented 11 months ago

I am thinking about the ProofByContra rule. Depth > 1 is definitely not trivial. Keep in mind that I am good at solving the puzzles not generating them )) So for inspiration I was looking at some examples in flan-t5/latest/4hop_Composed_random_noadj.json and found the following:

"test_example": {"question": "Every lorpus is a brimpus. Everything that is a lempus, a wumpus, and a gorpus is a brimpus. Every gorpus is a rompus. Each impus is a wumpus. Lorpuses are not rompuses. Zumpuses are gorpuses. Max is a wumpus. Numpuses are gorpuses. Tumpuses are lempuses. Everything that is a wumpus and a grimpus and a sterpus is a gorpus. Every impus is a brimpus. Every shumpus is a sterpus. Vumpuses are lorpuses, tumpuses, and zumpuses. Everything that is a wumpus, a grimpus, and a sterpus is a dumpus. Each dumpus is a rompus. Every lorpus is a grimpus. Max is a lempus. Max is a gorpus. Impuses are lempuses. Each vumpus is a lempus. Dumpuses are brimpuses. Max is a numpus, a vumpus, and an impus. Max is a lempus and a gorpus and a yumpus. Max is an impus. Max is a shumpus.", "query": "Prove: Max is not a lorpus.", "chain_of_thought": ["Max is a numpus, a vumpus, and an impus.", "Max is a vumpus.", "Vumpuses are lorpuses, tumpuses, and zumpuses.", "Max is a lorpus and a tumpus and a zumpus.", "Max is a lorpus.", "Lorpuses are not rompuses.", "Max is not a rompus.", "Max is a shumpus.", "Every shumpus is a sterpus.", "Max is a sterpus.\n\n", "Assume Max is a lorpus.", "Every lorpus is a grimpus.", "Max is a grimpus.", "Max is an impus.", "Each impus is a wumpus.", "Max is a wumpus.", "Max is a wumpus and a grimpus and a sterpus.", "Everything that is a wumpus, a grimpus, and a sterpus is a dumpus.", "Max is a dumpus.", "Each dumpus is a rompus.", "Max is a rompus.", "This contradicts with Max is not a rompus.", "Max is not a lorpus.\n\n"]}

Here the chain of thought first proves "Max is a lorpus." but then when trying to generate a contradiction it assumes the exact same thing to start with?! This must be a bug right?

Anyway looking at the above example... we have Lorpuses are not rompuses. which is translated as Not every lorpus is a rompus and NOT as Every lorpus is NOT a rompus? Correct? It is a bit hard to reason because it's a fictional ontology but if we imagine the following two statements: Max is an animal. Animals are not birds. Does it follow that Max is NOT a bird? It could be a bird or could not be one. So it does not follow. If we figure this one out we may be able to generate something at depth 2 for example. What do you think?

asaparov commented 11 months ago

This was a result of the same bug that I had fixed from before, right? If you run the script to generate that file, it wouldn't output that example.

Actually, "Lorpuses are not rompuses" has the meaning for all x(lorpus(x) => ~rompus(x)) (i.e. each lorpus is not a rompus). But I agree, this was the same difficulty that we were running into when we tried to come up with examples of proof by contradiction that require 2 or more nested contradiction steps.