LLM Benchmarking SourceData / Text extraction

drAbreu commented 5 months ago

As previously agreed in a series of conversations, added a series of test cases for benchmarking LLM models on the task of Text Extraction, specially related to the extraction of data for molecular and cell biology.

The tests were added tot he test_text_extraction.py module in benchmarking.

Three small examples are used here.

The code has been tested on a DGX machine for both, OpenAI and Xinference, giving successful results. We could provide further test cases for larger scale benchmarking upon necessity. This would require some time to be generated.

slobentanzer commented 5 months ago

Amazing work, thanks @drAbreu! Will review ASAP and also run on the other models to update the living benchmark.

slobentanzer commented 5 months ago

LLaMA3 actually has the same problem; it is also scoring close to 0.

slobentanzer commented 5 months ago

Interestingly, openhermes again performs well (on par with GPT3.5). Maybe we don't need to change anything for the first benchmark, and then benchmark again with a later BioChatter version that has a more bespoke process for text extraction with open source models (to see some progress).

slobentanzer commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

drAbreu commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

Indeed, this is something that I saw happening on GPT 4 and was surprising me. I will take a look to the edited behavior to cross check that it is the expected.

drAbreu commented 5 months ago

Hi @drAbreu, I had to move the conversation reset point in edfaa97 to prevent the messages in the conversation from accumulating. As it was, it just appended each task in the two loops in your test function to the conversation without removing previous messages, which led to overflow in the 4096 token context when I tried to run the benchmark on LLaMA2. Could you please check if the current code still captures the benchmarking behaviour you would like to see?

Side note: LLaMA2 performs poorly on the benchmark because it does not seem to understand the instruction, it responds with "Sure, happy to help, please give me the legend." Maybe an update to the benchmark process is merited (or maybe a dedicated module in BioChatter, as we discussed), but this is tangential. We should first make sure that the benchmark process itself runs as we expect.

I can confirm the current code is doing what is expected. It was dumb from my side not realizing about it before. Indeed there were a series of hints such as the time that the models needed to run and the OpenAI token charges. Now it makes fully sense.

This change is good and working as expected

drAbreu commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

Hi Sebastian, I will be on vacation for the next 10 days and will likely not be taking a look to this until I am back.

I fully agree that this should be tested.

As a first approach to the problem, I got a sample of results and scores from OpenHermes. With this test, I wanted to search for a possible bug in the scoring procedure. I show some results below. These show the score as expected. So in that sense the test is behaving as expected.

Whenever I am back to work or have a second I will do another series of checks to confirm this. I will compare the results of some of the answers for the three different prompts and therefore see whether there is any systematic behavior that explains the similar scores

EXPECTED: PINK1, FBXO7 RESPONSE: PINK1-/-, FBXO7-/- SCORE: 0.8

EXPECTED: cell_type: None organism: None tissue: None cell_line: HeLa subcellular: nucleus, mitochondria

RESPONSE: cell_type: None organism: None tissue: None cell_line: HeLa, PINK1-/-, FBXO7-/- subcellular: mitochondria, nuclear DNA (DAPI), pUb

SCORE: 0.7499999999999999

EXPECTED: 3D structured illumination microscopy (3D-SIM)
RESPONSE: ANSWER: 3D-SIM, AO-induced mitophagy
SCORE: 0.3333333333333333

EXPECTED: Yes RESPONSE: Yes SCORE: 1.0

EXPECTED: 0.0001 RESPONSE: *p(****)<0.0001 SCORE: 0.6666666666666666

EXPECTED: DAPI RESPONSE: AO, DAPI, HSP60 SCORE: 0.4

EXPECTED: Mean Acidic:Neutral mtKeima per-cell ratios, HeLa, mitochondria
RESPONSE: Mean Acidic:Neutral mtKeima per-cell ratios
SCORE: 0.823529411764706

slobentanzer commented 5 months ago

I made the following changes:

fixed scoring (was using a boolean vector scoring method for the numeric output of the ROUGE score; I am now using the ROUGE score directly as the test output); also renamed the boolean vector score method to be more explicit as to what it does
improved test case expansion (generate combinations of arbitrary numbers of case determination factors) and implemented in text extraction to result in scoring of each individual task/query, caption, and format (which I had to modify to become dictionaries instead of lists)
included text extraction results in the living benchmark output on the website

drAbreu commented 5 months ago

Hi Sebastian,

I have pulled your last changes to the code and now I am back to keep working on this. Is there any part where you want me to pay special. attention?

slobentanzer commented 5 months ago

@drAbreu welcome back; I guess most important at the moment is benchmark analysis and checking whether the tests actually are good indicators of performance. If you don't find any grave issues with that, I will merge after I have finished running the open source models.

drAbreu commented 5 months ago

Great :)

So, I have been playing around with Llama. It looks like there is some subtle behavior issue related to the user prompt.

It needs a kind of a suffix prompt that says something like "## ANSWER:" to work.

I leave here an example of THE USER PROMPT as it is in the code:

And is an example of adding "## ANSWER: " at the end of the user prompt:

So this might be the answer to the issues seen in the Llama models.

For the rest, I have been checking the outputs of the models and the examples and everything seems to be in order.

I think the course of action could be to add "## ANSWER:" as a suffix for the user prompt when the Llama models are used.

slobentanzer commented 4 months ago

I am thinking maybe by using this "## CAPTION", "## ANSWER FORMAT" etc syntax, we may push the model towards a more completion-like behaviour. I am thinking two things for next developments:

create dedicated extraction module to formalise these approaches in a programmatic way as opposed to engineering the prompts in the test (looking towards a more complete extraction package mid-term). In that module, we can think about accounting for model behaviour differences. Issue #153
optimise prompts automatically using the DSPy package and compare to the manual prompts. Issue #154

biocypher / biochatter

LLM Benchmarking SourceData / Text extraction #146