Installing the environments

Depending on what you want to reproduce, multiple distinct environments may need to be created. To just run the symbolic solvers on the logic programs in the repo, you need to install the Python packages using the requirements file, adapted from the original Logic-LM repository. The name can be changed, but note that the relevant bash scripts will assume it to be called `solver2'.

conda create --name solver2
conda activate solver2
pip install -r requirements.txt

To reproduce the prompting with Gemini you need to create a conda env and install the package needed for VertexAI. Note that due to conflicts in versions of dependencies this has to be a seperate env. The name can be changed, but note that the relevant bash scripts will assume it to be called `DL2'.

conda create --name DL2
conda activate DL2
pip install google-cloud-aiplatform

Multimodal data generation requires additional dependencies. You can find instructions on how to install them in the Multimodal section.

Setting up the LLMs

For this project various Large Language Models (LLMs) are used, namely GPT-4, Gemini and LLaMA. For GPT-4, an API key is required to run the code. To work with Gemini models, we utilized Vertex AI. Finally, LLaMA is accessed with Hugging Face, of which a login is needed.

Vertex AI for Gemini

To run the prompts using Vertex AI, you need to set up an account and a service account. See also the quickstart. Make sure to change the credentials to your own in the config files (baseline and Logic-LLM). Using your environment run gcloud auth login in the command line and follow the instructions in the pop up to login.

Azure for GPT-4

To use GPT family models using Microsoft Azure, you need to create an account and get an API key. You can follow the official documentation on how to setup the environment and get the key. This key should be stored in the AZURE_API_KEY environment variable. This can be done by running the following command in the command line:

export AZURE_API_KEY="your_key_here"

Snellius/own accelerator for Llama

Fill in form on Huggingface to get access to Llama models and create a GitHub CLI token.

How to use

Reproducing Gemini results

The commands mentioned in the original repository are still available. Additionally, shell scripts are added in the Logic-LLM folder to support Gemini models. For all the following bash files, uncomment the line that sets gemini_model to the model version of interest and comment the others.



Note that our results from the prompting are already in the github repo. To rerun these prompts (and for other model versions), go to the baseline folder and run the right bash file. This will run the prompts for all datasets from the paper with both the Direct and CoT mode.

cd Logic-LLM/baselines


To evaluate all Gemini baseline results, go (back) to the Logic-LLM folder and run:

conda deactivate
conda activate solver2
python3 ./baselines/evaluation_save.py

This will save the evaluation of the results in evaluation_baselines.json.



Note that our results from the prompting are already in the github repo. To rerun these prompts, go to the Logic-LLM folder and run the right bash file. This will run the prompts for all datasets from the paper.


Running symbolic solvers

Note that our results from running the solvers on our own results of the prompting are already in the github repo. To rerun the solvers, go to (or stay in) the Logic-LLM folder and run the right bash file. Note that by default the backup stategy of using the CoT baseline anwer from gemini-1.5-flash-preview-0514 will be used. This will run the symbolic solvers for all datasets from the paper:



To evaluate all Gemini Logic-LM results, in the Logic-LLM folder run:

conda deactivate
conda activate solver2
python3 ./models/evaluation_save.py

This will save the evaluation of the results in evaluation_baselines.json.

Order bias


Note that our results from the order bias prompting are already in the github repo. To rerun these prompts, go to the baselines folder and run the right bash file. This will run the prompts for all datasets from the paper.



To evaluate our experiments to determine order bias in the Gemini models, go (back) to the baseline folder and run:


This will print the respective evaluations of the accuracy for the case in which we make the right option always at the same position.


Data generation

We present generated data for multimodal experiments within the root project folder (Logic-LLM/data/). To generate new data, you can use our created data generation tool within multimodal_data_generator directory. Within it you will find instructions on how to generate new data.

Reproducing LLama results

Python scripts are added in the scripts folder for testing various components, including the support of LLaMA models, by using one example as input. The following just needs the solver2 env.



Run the command to generate Llama3 outputs for baseline (16 for Direct , 1024 for CoT):

python3 baselines/lama_baseline.py.py --model_name "lama3" --dataset_name "FOLIO" --split dev --mode "Direct" --max_new_tokens "16"


For evaluation, run:

python3 evaluate_llama.py \
 --dataset_name "Dataset Name [ProntoQA | ProofWriter | FOLIO | LogicalDeduction | AR-LSAT]" \
 --model_name "lama3" \
 --split dev \
 --mode "Baseline [Direct | CoT]" \



python3 models/logic_program_lama.py --dataset_name "AR-LSAT" --split dev --model_name "lama3" --max_new_tokens 1024

Running symbolic solvers

For inference:

python3 models/logic*inference.py \
 --model_name lama3 \
 --dataset_name ${DATASET} \
 --split dev \
 --backup_strategy "random or LLM" \
 --backup_LLM_result_path ./baselines/results/CoT*${DATASET}_${SPLIT}\_${MODEL}.json


For evaluation, run:

python3 evaluate_llama.py --dataset_name "FOLIO" --model_name "gpt-4" --split dev --backup "random or LLM"