This is the official implementation of the paper Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.
pip install -r requirements.txt
./data/cls/{dataset_name}
. For datasets of BBH, download from the repo CoT-hub and put them in the folder BBH/data/{dataset_name}
.auth.yaml
We instanciate two evolutionary algorithms, GA (genetic algorithm) and DE (diffenrential evolution) to evolve upon the initial population. Evolve your prompts using the following commands:
Customize the parameter --llm_type
to use text-davinci-003
, gpt-3.5-turbo
, gpt-4
.
# understanding task on Alpaca
bash scripts/cls/run_ga_alpaca.sh # Genetic algorithm
bash scripts/cls/run_de_alpaca.sh # Differential evolution
# simplification task on Alpaca
bash scripts/sim/run_de_alpaca.sh
bash scripts/sim/run_ga_alpaca.sh
# summarization task on Alpaca
bash scripts/sum/run_de_alpaca.sh
bash scripts/sum/run_ga_alpaca.sh
# for BBH tasks
cd BBH
bash scripts/run_de_cot.sh # DE
bash scripts/run_ga_cot.sh # GA
To evaluate a single instruction, run the following, set the argument --content
to evaluate a performance of a specific prompt
bash scripts/cls/eval_single_alpaca.sh # understanding task on alpaca
bash scripts/sim/eval_single_alpaca.sh # simplification
bash scripts/sum/eval_single_alpaca.sh # summarization
# BBH
cd BBH
bash scripts/eval.sh # few-shot evaluation
Note that we have two language models used in our framework, one is for evolution (argument --llm_type
), the other for the task implementation (--language_model
).
The number of iteration and the population size effect the performance of EvoPrompt. There exists a trade-off between the cost and the performance. For relative simple tasks, a size of 10 and 10 iterative steps are enough, or even less. While for complex tasks, a larger population with diversity is required.
You may need to set the following arguments to customize your own configuration.
task
: the task category, such as sim
(simplification), cls
(classification), sum
(summarization). If you need to extend this to other tasks, you may override the metric to evaluatedataset
: the dataset you want to evolve prompt ondev_file
: the path of the devlopment setlanguage model
: the model used for task implementationllm_type
: the LLM used to evolve promptsposition
: this argument mainly indicates whether to use demonstration (zero-shot or few-shot)sample_num
: the size of dev set, mainly used for generation task where there is no need to set the dev_file
prompt_num
: number of examples for few-shot demonstrationsFor the pipeline of EvoPrompt, there are mainly three steps as follows, while for each of them algorthms, there exists slight differences to instantiate.
Initialization: We apply prompts generated manually written or generated by GPT as the initial population. (see in the prompts.txt
and prompts_auto.txt
under the path of each dataset)
Evolution (mutation and crossover): For templates used for DE and GA, see the file ./data/templates_ga
and ./data/templates_de
. We use a demonstration including one example of the algorithm implementation to get precise and expected prompt following the steps of evolution. To avoid the LLMs copying the demonstration,the demonstration of the task is different from the task of implementation.
Evaluation and update: After each iteration, we need select which prompts should be maintained in the population to update. For GA, we maintain top-$N$ prompts in each iteration while for DE, we replace the old prompt if the newly generated is better.
sel_mode
to apply different strategy. There are three choices: ["wheel", "random", "tour"]
, we use wheel
by default.template
to use different settings.
--template v2
--donor_random
--template v1
(default setting)--template v3
p
, several donor prompts are used for the new prompt p'
, if p'
is better than p
, p
will be replaced by p'
. Otherwise, it will be maintained..
βββ args.py
βββ auth.yaml
βββ BBH # code for BBH tasks
βββ data # dataset, templates used
β βββ cls
β βββ sim
β βββ sum
β βββ template_de.py # templates of prompt evolution by DE
β βββ template_ga.py # templates of prompt evolution by GA
β βββ template_v2.json # templates for task implementation
β βββ templates.py # wrapper
βββ dataset.py # dataset class
βββ evaluator.py # evaluators on different tasks
βββ evoluter.py # DE, GA, APE
βββ evolution.py # DE, GA, APE
βββ get_result.py
βββ infer.py # main file for inference
βββ llm_client.py # LLM query
βββ metrics.py # metric calculation
βββ requirements.txt
βββ run.py # main file for evolution
βββ scripts # scripts to run the code
βββ utils.py # auxiliary functions
If you find this repository helpful, please consider citing our paper:
@article{guo2023connecting,
title={Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers},
author={Guo, Qingyan and Wang, Rui and Guo, Junliang and Li, Bei and Song, Kaitao and Tan, Xu and Liu, Guoqing and Bian, Jiang and Yang, Yujiu},
journal={arXiv preprint arXiv:2309.08532},
year={2023}
}
Our codebase is based on the following repos. Thanks for open-sourcing!
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.