XinyuanWangCS / PromptAgent

This is the official repo for "PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization". PromptAgent is a novel automatic prompt optimization method that autonomously crafts prompts equivalent in quality to those handcrafted by experts, i.e., expert-level prompts.
https://arxiv.org/abs/2310.16427
Apache License 2.0
162 stars 16 forks source link

Some replication problem #3

Closed LuckyAnJooo closed 5 months ago

LuckyAnJooo commented 5 months ago

Hello, I appreciate your kind words and efforts to replicate our results on my computer. Here is the process I followed: First, I installed the necessary environment as per the "Installation" instructions. Then, I ran the main function using the command "python src/main.py --task_name bigbench...". Afterward, I selected the prompt from "best_reward_path_selected_node" in the "data.json" file for testing. I placed that specific prompt in the "prompt.txt" file and used the following script for testing:

python src/test.py --task_name bigbench --eval_prompt "Answer questions about a table of penguins and their attributes." --prompt_file "prompts.txt" --train_size 70 --eval_size 50 --test_size 79 --seed 42 --pred_model 'gpt-3.5-turbo' --data_dir "datasets/penguins_in_a_table.json" --api_key "..."

I conducted tests with both 'gpt-3.5-turbo' and 'palm2', yielding results of 0.7468 and 0.1266, respectively. Here are my questions:

  1. In running the main.py script, does palm2 lack the capability to be used as both pred_model and optim_model?
  2. The code, as I understand from the paper, follows a process where running main.py generates a series of prompts (or paths/nodes on the MCTS), which are optimized to obtain "expert prompt output." In this context, test.py only needs to take any of these prompts for evaluation. Am I correct in my understanding?
  3. Is there a flaw in my replication process? The experimental results deviate from those presented in your paper. While the error with gpt-3.5-turbo is relatively small, there is a substantial difference in the results obtained with palm2. I would like to identify the specific step where the issue arises and appreciate your insights.

I look forward to your response.

XinyuanWangCS commented 5 months ago

Hi, thanks for using our code.

  1. Palm2 indeed lacks the capability as pred_model or optim_model. The accuracy will drop when the prompt is transfered to Palm2. But I didn't notice such dramatic drop in our experiment, since the human prompt can get 0.43 according to the paper. There may be a couple of reasons: 1. Different palm2 version; or maybe palm2 is updated so the output format doesn't follow the answer extraction format. You may look into the output file of the test.py and check what's the response look like. Since we mostly focus on OpenAI models, there could be answer format or extraction issue for Palm2. 2. There exists randomness, it may happen that palm2 just cannot understand this one prompt. You may check our paper on openreivew: https://openreview.net/pdf?id=22pyNMuIoa. There is more experiment on page 16 section C.2. It shows PaLM2 is a weaker model than GPT3.5 when it is used as the base model in PromptAgent.

  2. Yes, you may take any optimized prompt to test. The agent defaultly pick the highest average reward path.

  3. The same as question 1. PaLM2 doesn't has a good capability of following complex instructions, proved in the paper. Also there may be randomness or format issue. Since it has been 5 month after the experiment, the model or api maybe updated but not the PromptAgent code accordingly. (Actually not only the PaLM, GPT3.5 and 4 are also updating. You may use gpt-4-turbo to save some money : ) .

LuckyAnJooo commented 5 months ago

Thanks for your patient response.