There is a big difference between the optimized prompt and the initial text prompt

Few remarks:

In the GT ranking annotation prompt you wrote that the model "must strictly adhere to the directives provided in the initial text prompt". However, the model isn't provided with these initial instructions. You should explicitly provide them in the ranking prompt.
Having said that, even if you provide the guidelines to the prompt, you are not describing the model how to do the ranking (what is '4' score generation and what is '5'. Therefore, it might be that when fitting to this big details prompt, the model can squeeze it into very short/different phrasing, which will give equal results (close to 100% accuracy).
Overall fitting the generation prompt is split into two separate tasks:
- Fitting the ranking prompt
- Fitting the generation prompt If the score in both of them is high (0.9-1 in the ranking phase and 4.5-5 in the generation), and you are not satisfied with the results, that means that the issue is probably with the initial ranking annotator (the GT prompt). It's not easy to build a good GT ranking prompt...

Another small thing is that I recommend to use GPT-4.5 for the ranking predictor (not 3.5 as in your configuration) since this is a challenging task.

Eladlev / AutoPrompt