Eladlev / AutoPrompt

A framework for prompt tuning using Intent-based Prompt Calibration
Apache License 2.0
1.86k stars 149 forks source link

There is a big difference between the optimized prompt and the initial text prompt #42

Closed qiulongquan closed 4 months ago

qiulongquan commented 4 months ago

I refer to the documentation and use run_generation_pipeline.py to generate an optimized prompt. However, the resulting optimization results are far from the initial prompt, and many details are overlooked. I took a screenshot of the original prompt (which is about parsing the COBOL language and writing the analysis report) and the optimized prompt (the details of parsing are ignored a lot) and the related code

init prompt init_prompt

output files output.log config_yaml.txt

Please tell me why this is happening and how I can improve it. Thank you

Eladlev commented 4 months ago

Few remarks:

  1. In the GT ranking annotation prompt you wrote that the model "must strictly adhere to the directives provided in the initial text prompt". However, the model isn't provided with these initial instructions. You should explicitly provide them in the ranking prompt.
  2. Having said that, even if you provide the guidelines to the prompt, you are not describing the model how to do the ranking (what is '4' score generation and what is '5'. Therefore, it might be that when fitting to this big details prompt, the model can squeeze it into very short/different phrasing, which will give equal results (close to 100% accuracy).
  3. Overall fitting the generation prompt is split into two separate tasks:
    • Fitting the ranking prompt
    • Fitting the generation prompt If the score in both of them is high (0.9-1 in the ranking phase and 4.5-5 in the generation), and you are not satisfied with the results, that means that the issue is probably with the initial ranking annotator (the GT prompt). It's not easy to build a good GT ranking prompt...

Another small thing is that I recommend to use GPT-4.5 for the ranking predictor (not 3.5 as in your configuration) since this is a challenging task.