Closed drAbreu closed 4 months ago
Amazing work, thanks @drAbreu! Will review ASAP and also run on the other models to update the living benchmark.
LLaMA3 actually has the same problem; it is also scoring close to 0.
Interestingly, openhermes again performs well (on par with GPT3.5). Maybe we don't need to change anything for the first benchmark, and then benchmark again with a later BioChatter version that has a more bespoke process for text extraction with open source models (to see some progress).
Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0
or around 1.8
. The openhermes
model has the exact same score in all three scenarios.
Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to
0
or around1.8
. Theopenhermes
model has the exact same score in all three scenarios.
Indeed, this is something that I saw happening on GPT 4 and was surprising me. I will take a look to the edited behavior to cross check that it is the expected.
Hi @drAbreu, I had to move the conversation reset point in edfaa97 to prevent the messages in the conversation from accumulating. As it was, it just appended each task in the two loops in your test function to the conversation without removing previous messages, which led to overflow in the 4096 token context when I tried to run the benchmark on LLaMA2. Could you please check if the current code still captures the benchmarking behaviour you would like to see?
Side note: LLaMA2 performs poorly on the benchmark because it does not seem to understand the instruction, it responds with "Sure, happy to help, please give me the legend." Maybe an update to the benchmark process is merited (or maybe a dedicated module in BioChatter, as we discussed), but this is tangential. We should first make sure that the benchmark process itself runs as we expect.
I can confirm the current code is doing what is expected. It was dumb from my side not realizing about it before. Indeed there were a series of hints such as the time that the models needed to run and the OpenAI token charges. Now it makes fully sense.
This change is good and working as expected
Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to
0
or around1.8
. Theopenhermes
model has the exact same score in all three scenarios.
Hi Sebastian, I will be on vacation for the next 10 days and will likely not be taking a look to this until I am back.
I fully agree that this should be tested.
As a first approach to the problem, I got a sample of results and scores from OpenHermes. With this test, I wanted to search for a possible bug in the scoring procedure. I show some results below. These show the score as expected. So in that sense the test is behaving as expected.
Whenever I am back to work or have a second I will do another series of checks to confirm this. I will compare the results of some of the answers for the three different prompts and therefore see whether there is any systematic behavior that explains the similar scores
EXPECTED: PINK1, FBXO7 RESPONSE: PINK1-/-, FBXO7-/- SCORE: 0.8
EXPECTED: cell_type: None organism: None tissue: None cell_line: HeLa subcellular: nucleus, mitochondria
RESPONSE: cell_type: None organism: None tissue: None cell_line: HeLa, PINK1-/-, FBXO7-/- subcellular: mitochondria, nuclear DNA (DAPI), pUb
SCORE: 0.7499999999999999
EXPECTED:
3D structured illumination microscopy (3D-SIM)
RESPONSE:
ANSWER: 3D-SIM, AO-induced mitophagy
SCORE:
0.3333333333333333
EXPECTED: Yes RESPONSE: Yes SCORE: 1.0
EXPECTED: 0.0001 RESPONSE: *p(****)<0.0001 SCORE: 0.6666666666666666
EXPECTED: DAPI RESPONSE: AO, DAPI, HSP60 SCORE: 0.4
EXPECTED: Mean Acidic:Neutral mtKeima per-cell ratios, HeLa, mitochondria
RESPONSE: Mean Acidic:Neutral mtKeima per-cell ratios
SCORE: 0.823529411764706
I made the following changes:
Hi Sebastian,
I have pulled your last changes to the code and now I am back to keep working on this. Is there any part where you want me to pay special. attention?
@drAbreu welcome back; I guess most important at the moment is benchmark analysis and checking whether the tests actually are good indicators of performance. If you don't find any grave issues with that, I will merge after I have finished running the open source models.
Great :)
So, I have been playing around with Llama. It looks like there is some subtle behavior issue related to the user prompt.
It needs a kind of a suffix prompt that says something like "## ANSWER:" to work.
I leave here an example of THE USER PROMPT as it is in the code:
And is an example of adding "## ANSWER: " at the end of the user prompt:
So this might be the answer to the issues seen in the Llama models.
For the rest, I have been checking the outputs of the models and the examples and everything seems to be in order.
I think the course of action could be to add "## ANSWER:" as a suffix for the user prompt when the Llama models are used.
I am thinking maybe by using this "## CAPTION", "## ANSWER FORMAT" etc syntax, we may push the model towards a more completion-like behaviour. I am thinking two things for next developments:
As previously agreed in a series of conversations, added a series of test cases for benchmarking LLM models on the task of Text Extraction, specially related to the extraction of data for molecular and cell biology.
The tests were added tot he test_text_extraction.py module in benchmarking.
Three small examples are used here.
The code has been tested on a DGX machine for both, OpenAI and Xinference, giving successful results. We could provide further test cases for larger scale benchmarking upon necessity. This would require some time to be generated.