This is the repo for AlpaCare, which are LLMs tuned on medical instructions. The repo contains:
AlpaCare models contain 4 models (7B/13B - LLaMA[1]/LLaMA-2[2]) tuned on a 52k medical instruction-following dataset MedInstruct-52k, following Alpaca[3] and Self-Instruct[4]. You can find our model weights at:
Version | Link |
---|---|
AlpaCare -LLaMA_7B | https://huggingface.co/xz97/AlpaCare-llama1-7b |
AlpaCare -LLaMA2_7B | https://huggingface.co/xz97/AlpaCare-llama2-7b |
AlpaCare -LLaMA_13B | https://huggingface.co/xz97/AlpaCare-llama-13b |
AlpaCare -LLaMA2_13B | https://huggingface.co/xz97/AlpaCare-llama2-13b |
To set up a conda environment for data generation/model training, please do:
pip install -r requirements.txt
[1]: LLaMA: Open and Efficient Foundation Language Models. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, Guillaume Lample. https://arxiv.org/abs/2302.13971v1
[2] Llama 2: Open foundation and fine-tuned chat models. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. https://arxiv.org/abs/2307.09288
[3]: Stanford Alpaca: An Instruction-following LLaMA Model.Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto. https://crfm.stanford.edu/2023/03/13/alpaca.html
[4]: Self-Instruct: Aligning Language Model with Self Generated Instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. https://arxiv.org/abs/2212.10560
MedInstruct contains:
MedInstruct datasets:
gpt-4
, gpt-tubro-3.5
, text-davinci-003
and claude-2
.All files is a list of dictionaries in JSON/JSONL format, each dictionary contains the following fields:
instruction
: str
, the medical task the model should perform. Each of instrcutions in MedInstruct-52K and MedInstruct-test is unique. The instructions in MedInstruct-52K are generated by OpenAI gpt-4
, while MedInstruct-test are clinian-craft.input
: str
, optional medical context or example for the task to further explain the instruction and increase the task diversity.output
: str
, the answer to the instruction as generated by text-davinci-003
.difficulty
: str
, medical knowledge difficulty level scale from 1-5 evaluated by clinicians following the scoring system. This only appears in MedInstruct-test.Instructions:
gpt-4
with/without filtering.We following the data generation pipeline from self-instruct and Alpaca, and modified several points: Instructions:
Responses:
Here's an illustration of an instruction-response pair:
{
instruction: "Rewrite this instruction on how to use an inhaler to be more patient-friendly and less technical."
input: "Exhale deeply, engage mouthpiece of inhaler, depress canister while simultaneously inhaling long and deep, hold the breath for ten seconds, then gently exhale"
output: "Here's a more patient-friendly version of the instruction on how to use an inhaler:
1. Take a deep breath and completely breathe out.
2. Put the mouthpiece of the inhaler into your mouth and seal your lips around it properly.
3. Press down the canister while breathing in slowly and deeply at the same time.
4. Hold your breath for ten seconds. 5. Finally, breathe out gently."
}
To generate the data:
Please check task generation script:
sh task_output_generation/task_generation.sh
and output generation script:
sh task_output_generation/output_generation.sh
In instruction data analysis, we demonstrate the diversity of the MedInstruct-52K in terms of:
(a) Instruction Language: The inner circle displays the 20 most frequent root verbs, while the outer circle showcases the top 4 associated noun objects from the generated instructions. Although there is a wide range, only 22\% of the instructions are covered, as others do not adhere to the verb-noun format.
(b) View: The top 20 frequent views from various medical personnel constitute 55\% of MedInstruct-52K.
(c) Task Types The top 20 covered in MedInstruct-52K. Existing medical instruction-tuned models only focus on question-answering and doctor-patient conversation tasks.
We follows Alpaca prompt to fine-tune LLaMA series models and use standard Hugging Face training code.
For the instruction-finetuning of LLaMA/LLaMA-7B:
sh training/train_7b.sh
For the instruction-finetuning of LLaMA/LLaMA-13B:
sh training/train_13b.sh
We compare AlpaCare with several instruction-tuned LLMs based on the LLaMA models, across different scales and with various tuning datasets. Free-form instruction evaluations are conducted by evaluating on iClinq, a patient-doctor conversation set, and a medical instruction test set crafted by our clinicians (MedInstruct-test). To further evaluate the generalization ability, we use a general domain test set, AlpacaFarm.
AlpaCare shows strong medical capacity and generalization ability compared to baselines on both 7B and 13B scales. We follow AlpacaFarm to utilize gpt-turbo-3.5 as the judge for the comparison. We compare each instruction-tuned model with 4 distinct reference models: text-davinci-003, gpt-3.5-turbo, gpt-4, and claude-2, respectively.
We provide all the reference model output and instcution-tunned model ouput
If you think it is a useful repo, please cite the paper:
@misc{zhang2023alpacareinstructiontuned,
title={AlpaCare:Instruction-tuned Large Language Models for Medical Application},
author={Xinlu Zhang and Chenxin Tian and Xianjun Yang and Lichang Chen and Zekun Li and Linda Ruth Petzold},
year={2023},
eprint={2310.14558},
archivePrefix={arXiv},
primaryClass={cs.CL}
}