epfLLM / meditron

Meditron is a suite of open-source medical Large Language Models (LLMs).
https://huggingface.co/epfl-llm
Apache License 2.0
1.77k stars 159 forks source link

Medprompt #20

Open AGBonnet opened 6 months ago

AGBonnet commented 6 months ago

Meditron x MedPrompt

Here's a first step to run MedPrompt on Meditron. This code is untested and your reviews are very welcome.

MedPrompt is composed of 3 steps:

  1. Dynamic few-shot by adding to the prompt 5 QA examples with highest cosine similarity to the query (added)
    • Embed all training questions using OpenAI's 'text-embedding-ada-002'
    • Select top K exemplars (K = args.shots) with highest similarity to the question at hand for each test question
  2. CoT (already implemented)
  3. Ensemble with Choice shuffling: From the 5-shot CoT prompt, generate multiple answers (with CoT as well) and select majority vote. Choice shuffling avoids the bias of choosing certain options more frequently (we already had self-consistency, only added choice shuffling)
    • Add choice shuffling through Benchmark.load_data() and custom_preprocessing for MCQ benchmarks to debias final test questions
    • Check shuffling random seed for reproduciblity

NOTE: in the original paper, potential candidates for KNN exemplars are QA pairs 'aced' by the model in 0-shot For now, this is not implemented, because it requires to run inference on the dataset beforehand. However, we might be able to collect correct QA pairs from our past 0-shot evaluations runs

For reference, here is the MedPrompt algorithm:

ALGORITHM
Input: Development data D, Test question Q 

A) PREPROCESSING:
for each question q in D:
    Get an embedding vector vq for q.
    Generate a chain-of-thought Cq and an answer Aq with the LLM. 
    if Answer Aq is correct then
        Store the embedding vector vq, chain-of-thought Cq, and answer Aq.

B) INFERENCE:
Compute the embedding vQ for the test question Q.
Select the K most similar examples {(vQi , CQi , AQi )}_{i=1}^K from the preprocessed training data using KNN,
    with the distance function as the cosine similarity: dist(vq,vQ)=1−⟨vq,vQ⟩/(||vq||*||vQ||)
Format the K examples as context C for the LLM. 
for i=1 to K:
    Shuffle the answer choices of the test question.
    Generate a chain-of-thought Cqk and an answer Akq with the LLM and context C.
Compute the majority vote of the generated answers A_Final = mode({Akq}_{k=1}^K)
AGBonnet commented 6 months ago

Added choice de-shuffling before evaluation, so that the self-consistency majority answer can be selected.

Remaining problems:

(1) Get access to CoT references for the whole training set. ThoughtSource has made some work on this, might be worth looking into. In the paper they say they provide CoT for all dataset except PubMedQA and MedQA, but they’re shown in the github repo. We created these reference CoTs by converting rationales provided by original datasets into reasoning chains.

(2) Self-generate explanations (as done in the MedPrompt paper originally)