Meditron-7b doesn't behave as expected

bitmman commented 11 months ago

I've been experimenting with Meditron-7b for answering medical queries, but its performance seems not as expected compared to other LLM models.

I loaded the model and tokenizer and then used the standard HF pipeline:

pipeline = transformers.pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    temperature=0.01,
    do_sample=True,
    top_k=3,
    top_p=0.01,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=200,
)

Then I used langchain wrapper:

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipeline)

For a simple greeting with llm(prompt="Hi, how are you?"), the model repetitively echoed the prompt:

'\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi,'

When asked about lung cancer risk factors with llm(prompt="What are the risk factors for lung cancer?"),, it provided a list of related questions instead of direct answers:

What are the symptoms of lung cancer?

What causes lung cancer?

What are the stages of lung cancer?

When to seek urgent medical care?

How to diagnose lung cancer?

How to treat lung cancer?

How to prevent lung cancer?

What to expect (Outlook/Prognosis)?

Further, using a formatted prompt based on a GitHub repository example, the response included the prompt format instructions verbatim, without addressing the medical query.

def format_prompt(prompt):
    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."""
    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"

example = {
        "prompt": """Four weeks after starting hydrochlorothiazide, a 49-year-old man with hypertension comes to the physician because of muscle cramps and weakness. His home medications also include amlodipine. His blood pressure today is 176/87 mm Hg. Physical examination shows no abnormalities. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?\n\nOptions:\nA. Torsemide \nB. Nifedipine \nC. Eplerenone \nD. Hydralazine""",
        "gold": "C",
        "steps": [
            "The patient has started hydrochlorothiazide.",
            "He now presents with muscle cramps and weakness and an ECG that supports the diagnosis of hypokalemia.",
            "(A) Torsemide is a loop diuretic and would likely aggravate the hypokalemia.",
            "(B) Nifedipine is a calcium antagonist and would not alleviate the hypocalcemia.",
            "(C) Eplerenone is a potassium-sparing diuretic and would likely decrease the chance of hypokalemia.",
            "(D) Hydralazine is a potent vasodilator and would not decrease the risk of hypokalemia.",
        ]
    }

prompt = format_prompt(example['prompt'])
res = llm(prompt=prompt )
print(res)

And this returned

You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.If you don't know the answer to a question, please don't share false information.<|im_end|> <|im_start|> user A 65-year-old man with a history of hypertension and hyperlipidemia presents with a 2-week history of progressive dyspnea on exertion. He has a history of smoking 1 pack of cigarettes per day for 30 years. He has no history of diabetes mellitus, coronary artery disease, or peripheral vascular disease. His blood pressure is 150/90 mm Hg, and his pulse is 80 beats per minute. Physical examination reveals a grade 3/6 systolic murmur at the apex. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?

Options: A. Amlodipine B. Lisinopril C. Metoprolol D. Nifedipine<|im_end|> <|im_start|> assistant You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

Is this behavior typical for Meditron-7b, or might it be an issue with my prompting technique? Additionally, would Meditron-70b potentially yield better results?

eric11eca commented 11 months ago

Hi there, thanks for reaching out! Please see our answer to this related issue #9

In short, the <|im_start|> and <|im_end|> format was used for our finetuned models (not released yet) only. For the base model, you can apply in-context learning by providing the model with several demonstrations. Or, you can follow the one-shot example we mentioned in our deployment doc here if you are doing chat-based prompting.

In addition, the 70B model yields much better results. In our paper, you can see the performance comparisons we reported for in-context learning.

Hope this helps answer your question.

bitmman commented 11 months ago

Hi, thanks for your prompt answer. I experimented with providing one-shot example. It sometimes works fine but sometimes not. Here is my example prompt:

You are an expert in identifying risk factors for diseases. 
Answer the question in a concise way. I'll show you an example, and you resond in a similar way.
### USER:
What are the risk factors for lung cancer?
### Assistant:
Smoking
Exposure to Radon Gas
Exposure to Asbestos and Other Carcinogens
Family History of Lung Cancer
Personal History of Lung Disease
Air Pollution
Radiation Therapy to the Chest
Age

### USER:
What are the risk factors for CKD?
### Assistant:

It returns

### USER:
What are the risk factors for CKD?
### Assistant:
Smoking
Diabetes
High Blood Pressure
Family History of Kidney Disease
Personal History of Kidney Disease
Obesity
Age
Race
Sex
Socioeconomic Status
Exposure to Heavy Metals
Exposure to Pesticides
Exposure to Herbicides
Exposure to Chemicals
Exposure to Radiation
Exposure to Heavy Metals
Exposure to Pesticides
Exposure to Herbicides
Exposure to Chemicals
Exposure to Radiation
Age
Race
Sex
Socioeconomic Status
Exposure to Heavy Metals

It seems okay, but for the next questionquery = "What are the risk factors for breast cancer?" using the same prompt, I got

### USER:
What are the risk factors for prostate cancer?
### Assistant:
Age
Family History of Prostate Cancer
Race
Personal History of Prostate Disease
Exposure to Radiation
Exposure to Chemicals
Obesity
Smoking
Alcohol
Diet
Family History of Other Cancers
Family History of Breast Cancer
Family History of Colorectal Cancer
Family History of Lung Cancer
Family History of Ovarian Cancer
Family History of Pancreatic Cancer
Family History of Prostate Cancer
Family History of Stomach Cancer
Family History of Thyroid Cancer
Family History of Uterine Cancer
Family History of Uterine Cancer
Family History of Uterine Cancer

It keeps repeating itself. Any suggestions to improve the performance? I appreciate your help.

bitmman commented 11 months ago

Additionally, the model often spits back what I input. Do you have any idea how to avoid this kind of issue? Thanks.

charlestang06 commented 7 months ago

I am also encountering this issue. Sometimes the model also returns the same question and refuses to answer the question in the one-shot format above.

epfLLM / meditron

Meditron-7b doesn't behave as expected #13