Open slobentanzer opened 5 months ago
@drAbreu could you update briefly with your recent experiences?
A series of experiments were performed to investigate whether DSPy has the power to improve the benchmarking results, specifically among the Llama family of models.
Unfortunately, it seems that it is not possible to use system prompts in the Llama models on DSPy as of now. A bit of extra research has also shown me that the template that we are using on the system prompt for the information extraction, while understood by Open AI, it is not understood by other models. One example is Claude, where having the template
FIGURE CAPTION: {{figure legend}} ##\n\n## QUERY: {{query}} ##\n\n## ANSWER FORMAT: {{format}}. Submit your answer EXTRICTLY in the format specified by {{format}}
leads the model kind of fail, while taking the template out leads to good results. This is likely due to the fact that Claude uses XML-like tags for prompt templating, as oposed to GPT. This issue speaks clearly about prompt engineering issues that will be model dependent.
This poses the question of whether our current benchmark for information_extraction
is meaningful since the issues of models other than GPT might arise from a lack of prompt understanding, and we would see this as a result instead of the actual capacity of the models on extracting the required information.
The idea of DSPy was to improve the prompt or the system prompt, increasing the quality of the LLM inferences. However, I do not see this happening in our information extraction.
I have been comparing GPT3.5, GPT4 and Claude3.5 using the baseline API results, and then some of the different solutions of DSPy.
As shown below, Claude3.5 works better than any pf the GPT models, with the surprise that gpt-4o is overperformed by GPT3.5 :hug:
Also interesting is to see that the most basic DSPy uses (Signature and ChainOfThought) just make the models worse.
Few shot learning is what provides indeed the best results. Below are shown the results
(Rogue scores) | gpt-3.5-turbo | claude-3-opus-20240229 | gpt-4o |
---|---|---|---|
Baseline | 0.41 +/- 0.32 | 0.58 +/- 0.39 | 0.39 +/- 0.34 |
DSPy Signature | 0.37 +/- 0.31 | 0.35 +/- 0.31 | 0.28 +/- 0.26 |
DSPy ChainOfThought | 0.28 +/- 0.30 | 0.37 +/- 0.33 | 0.25 +/- 0.26 |
DSPy LabeledFewShot | 0.48 +/- 0.37 | 0.66 +/- 0.34 | 0.44 +/- 0.30 |
DSPy BootstrapFewShot | 0.47 +/- 0.35 | 0.58 +/- 0.4 | 0.43 +/- 0.26 |
Introducing the system prompt as a learnable parameter does not actually improve anything. Using this Few Shot learning process the system prompt is actually not even modifying a tiny bit by the compiler of DSPy.
The results do not change either.
This experiment might suggest that keeping track of the prompt engineering of different model families might be important to make the framework as universal as possible.
Very nice analysis, thanks! Aligns with my intuition that the model creators are doing many individualistic things and it would thus be valuable to know the peculiarities of each model family and account for it in the backend to get comparable results between models. I'll be off next week but let's catch up in September. :)
the issues of models other than GPT might arise from a lack of prompt understanding
In fact, I did suspect that, but I think it is still valid to test, because this is the application we use. The next step would be the extraction module I suggested, where we look at each model family and create family-specific prompts to improve their performance. This would increase BioChatter version, and we would hopefully see a positive trend in extraction performance in some of the models.
most basic DSPy uses (Signature and ChainOfThought) just make the models worse
That is very interesting and counterintuitive, although I am not surprised.
with the surprise that gpt-4o is overperformed by GPT3.5
We see this in many instances. My guess is that it has to do with the internal system instructions.
There remain some questions about the right prompt for the behaviour of the different models; llama series models seem to handle prompts differently than GPT. As an initial experiment, DSPy will be used to generate optimal text extraction prompts for a selection of models (GPT, llama, mi(s/x)tral), which will then be examined for their differences.