📃 Paper • 🤗 Huatuo-Lite • 🤗 huatuo_encyclopedia_qa • 🤗 knowledge_graph_qa • 🤗 huatuo_consultation_qa
中文 | English
The Huatuo-26M dataset is collected and integrated from multiple sources, including:
Each question-answer pair in the dataset contains the following fields:
The following is the huatuo test set we used in the paper, which consists of random sampling of data from multiple sources.
The Huatuo-26M dataset can be used for a variety of AI research and applications in the medical field, such as:
To start using the Huatuo-26M dataset, you can follow the steps below:
import datasets
# part 1
knowledge_graph_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_knowledge_graph_qa')
# part 2
encyclopedia_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_encyclopedia_qa')
# part 3 (only url)
consultation_dataset = datasets.load_dataset('FreedomIntelligence/huatuo_consultation_qa')
# testdatasets (6k)
huatuo_testdatasets = datasets.load_dataset('FreedomIntelligence/huatuo26M-testdatasets')
Retrieval Evaluation:
Answer Generation Evaluation:
Zero-shot transfer to other QA datasets:
As external knowledge for RAG:
As pre-training data for language model (LM):
As fine-tuning data for Medical LLM:
The Huatuo-26M dataset is licensed under Apache 2.0. Please make sure you have read and agreed to the license terms before using it.
If you have any questions or need help, please feel free to ask us via email (xidongw@163.com)or in the Issues section.
@misc{li2023huatuo26m,
title={Huatuo-26M, a Large-scale Chinese Medical QA Dataset},
author={Jianquan Li and Xidong Wang and Xiangbo Wu and Zhiyi Zhang and Xiaolong Xu and Jie Fu and Prayag Tiwari and Xiang Wan and Benyou Wang},
year={2023},
eprint={2305.01526},
archivePrefix={arXiv},
primaryClass={cs.CL}
}