biocypher / biochatter

Backend library for conversational AI in biomedicine
http://biochatter.org/
MIT License
62 stars 22 forks source link

LLM Benchmarking SourceData / Text extraction #146

Closed drAbreu closed 4 months ago

drAbreu commented 5 months ago

As previously agreed in a series of conversations, added a series of test cases for benchmarking LLM models on the task of Text Extraction, specially related to the extraction of data for molecular and cell biology.

The tests were added tot he test_text_extraction.py module in benchmarking.

Three small examples are used here.

The code has been tested on a DGX machine for both, OpenAI and Xinference, giving successful results. We could provide further test cases for larger scale benchmarking upon necessity. This would require some time to be generated.

slobentanzer commented 5 months ago

Amazing work, thanks @drAbreu! Will review ASAP and also run on the other models to update the living benchmark.

slobentanzer commented 5 months ago

LLaMA3 actually has the same problem; it is also scoring close to 0.

slobentanzer commented 5 months ago

Interestingly, openhermes again performs well (on par with GPT3.5). Maybe we don't need to change anything for the first benchmark, and then benchmark again with a later BioChatter version that has a more bespoke process for text extraction with open source models (to see some progress).

slobentanzer commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

drAbreu commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

Indeed, this is something that I saw happening on GPT 4 and was surprising me. I will take a look to the edited behavior to cross check that it is the expected.

drAbreu commented 5 months ago

Hi @drAbreu, I had to move the conversation reset point in edfaa97 to prevent the messages in the conversation from accumulating. As it was, it just appended each task in the two loops in your test function to the conversation without removing previous messages, which led to overflow in the 4096 token context when I tried to run the benchmark on LLaMA2. Could you please check if the current code still captures the benchmarking behaviour you would like to see?

Side note: LLaMA2 performs poorly on the benchmark because it does not seem to understand the instruction, it responds with "Sure, happy to help, please give me the legend." Maybe an update to the benchmark process is merited (or maybe a dedicated module in BioChatter, as we discussed), but this is tangential. We should first make sure that the benchmark process itself runs as we expect.

I can confirm the current code is doing what is expected. It was dumb from my side not realizing about it before. Indeed there were a series of hints such as the time that the models needed to run and the OpenAI token charges. Now it makes fully sense.

This change is good and working as expected

drAbreu commented 5 months ago

Another observation: the different system messages (simple, detailed, few shot) appear to sometimes give the exact same performance (see scores in ec86c2d). I think this should be investigated, at least to me that is surprising. Performance seems almost binary; scores are close to 0 or around 1.8. The openhermes model has the exact same score in all three scenarios.

Hi Sebastian, I will be on vacation for the next 10 days and will likely not be taking a look to this until I am back.

I fully agree that this should be tested.

As a first approach to the problem, I got a sample of results and scores from OpenHermes. With this test, I wanted to search for a possible bug in the scoring procedure. I show some results below. These show the score as expected. So in that sense the test is behaving as expected.

Whenever I am back to work or have a second I will do another series of checks to confirm this. I will compare the results of some of the answers for the three different prompts and therefore see whether there is any systematic behavior that explains the similar scores

EXPECTED: PINK1, FBXO7 RESPONSE: PINK1-/-, FBXO7-/- SCORE: 0.8


EXPECTED: cell_type: None organism: None tissue: None cell_line: HeLa subcellular: nucleus, mitochondria

RESPONSE: cell_type: None organism: None tissue: None cell_line: HeLa, PINK1-/-, FBXO7-/- subcellular: mitochondria, nuclear DNA (DAPI), pUb

SCORE: 0.7499999999999999


EXPECTED: 3D structured illumination microscopy (3D-SIM)
RESPONSE: ANSWER: 3D-SIM, AO-induced mitophagy
SCORE: 0.3333333333333333


EXPECTED: Yes RESPONSE: Yes SCORE: 1.0


EXPECTED: 0.0001 RESPONSE: *p(****)<0.0001 SCORE: 0.6666666666666666


EXPECTED: DAPI RESPONSE: AO, DAPI, HSP60 SCORE: 0.4


EXPECTED: Mean Acidic:Neutral mtKeima per-cell ratios, HeLa, mitochondria
RESPONSE: Mean Acidic:Neutral mtKeima per-cell ratios
SCORE: 0.823529411764706

slobentanzer commented 5 months ago

I made the following changes:

drAbreu commented 5 months ago

Hi Sebastian,

I have pulled your last changes to the code and now I am back to keep working on this. Is there any part where you want me to pay special. attention?

slobentanzer commented 5 months ago

@drAbreu welcome back; I guess most important at the moment is benchmark analysis and checking whether the tests actually are good indicators of performance. If you don't find any grave issues with that, I will merge after I have finished running the open source models.

drAbreu commented 5 months ago

Great :)

So, I have been playing around with Llama. It looks like there is some subtle behavior issue related to the user prompt.

It needs a kind of a suffix prompt that says something like "## ANSWER:" to work.

I leave here an example of THE USER PROMPT as it is in the code:

image

And is an example of adding "## ANSWER: " at the end of the user prompt:

image

So this might be the answer to the issues seen in the Llama models.

For the rest, I have been checking the outputs of the models and the examples and everything seems to be in order.

I think the course of action could be to add "## ANSWER:" as a suffix for the user prompt when the Llama models are used.

slobentanzer commented 4 months ago

I am thinking maybe by using this "## CAPTION", "## ANSWER FORMAT" etc syntax, we may push the model towards a more completion-like behaviour. I am thinking two things for next developments: