Data-Centric Foundation Models in Computational Healthcare
:fire::fire::fire: A survey on data-centric foundation models in computational healthcare
Project Page | Paper [arXiv]
Last updated: 2024/10/08
:pencil: If you find this repo helps, please kindly cite our survey, thanks!
@article{zhang2024data,
title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
journal={arXiv preprint arXiv:2401.02458},
year={2024}
}
In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.
:book: Contents
Healthcare and Medical Foundation Models
A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.
Language Models
Vision Models
Vision-Language Models
Protein and Molecule Models
Other Models
Datasets for Foundation Model
Text
Dataset (Paper) |
Description |
Link |
MedBench (arXiv) |
A Chinese medical LLM benchmark with 300,901 Chinese questions covering 43 clinical specialties, combined with an automatic evaluation system |
Official site |
MMedBench (arXiv) |
A multilingual medical QA benchmark, where questions are categorized into 21 topics |
Github |
MMedC (arXiv) |
A multilingual medical corpus containing over 25.5B tokens |
Github |
BiMed1.3M (arXiv) |
An English and Arabic bilingual dataset of 1.3M samples of medical QA and chat |
Github |
GAP-Replay (arXiv) |
48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replay |
Github |
Huatuo-26M (arXiv) |
26M Chinese medical QA pairs |
Github |
Medical Meadow (arXiv) |
16M medical QA pairs collected from 9 sources |
Github |
MultiMedQA (Nature) |
6 existing and 1 online-collected medical QA dataset |
Nature |
BigBio (Nature) |
126+ biomedical NLP datasets covering 13 task categories and 10+ languages |
Github |
MedMCQA (MLR) |
194K multiple-choice questions covering 2.4K healthcare topics |
Official site |
MedQA-USMLE (MDPI) |
61,097 multiple choice questions based on USMLE in three languages |
Github |
CBLUE (arXiv) |
A Chinese biomedical language understanding evaluation benchmark with 18 datasets |
Official site |
BLURB (arXiv) |
13 biomedical NLP datasets in 6 tasks |
Official site |
PubMedQA (arXiv) |
1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instances |
Official site |
BLUE (arXiv) |
5 language tasks with 10 biomedical and clinical text datasets |
Github |
webMedQA (BMC) |
63,284 real-world Chinese medical questions with over 300K answers |
Github |
MedMentions (arXiv) |
4,392 papers annotated by experts with mentions of UMLS entities |
Github |
MIMIC-III (Nature) |
Critical care data for over 40,000 patients |
Official site |
ClinicalTrials.gov |
An online database of clinical research studies, including clinical trials and observational studies |
Official site |
Imaging
Dataset (Paper) |
Description |
Link |
Mass-100K (arXiv) |
100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types |
- |
RETFound (Nature) |
Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scans |
Nature |
AbdomenAtlas-8K (arXiv) |
8,448 CT volumes with per-voxel annotated eight abdominal organs |
Github |
Med-MNIST v2 (Nature) |
12 2D and 6 3D datasets for biomedical image classification |
Official site |
EchoNet-Dynamic (Nature) |
10,030 expert-annotated echocardiogram videos |
Official site |
CheXpert (arXiv) |
224,316 chest radiographs of 65,240 patients |
Official site |
Kather Colon Dataset (PMC) |
100K histological images of human colorectal cancer and healthy tissue |
Zenodo |
DeepLesion (PMC) |
32K CT scans with annotations and semantic labels from radiological reports |
NIH |
ChestXray-NIHCC (arXiv) |
100K radiographs with labels from more than 30,000 patients |
NIH |
ISIC |
An archive containing 23K skin lesion images with labels & Imaging |
Official site |
Genomics
Dataset (Paper) |
Description |
Link |
1000 Genomes Project (Nature) |
A comprehensive catalog of human genetic variations |
Official site |
ENCODE (Nature) |
A platform of genomics data and encyclopedia with integrative-level and ground-level annotations |
NIH |
dbSNP (NIH) |
A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletions |
NIH |
Drug
Dataset (Paper) |
Description |
Link |
DrugChat (arXiv) |
143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBL |
Github |
PubChem (NIH) |
A collection of 900+ sources of chemical information data |
NIH |
DrugBank (NIH) |
A web-enabled structured database of molecular information about drugs |
Official site |
ChEMBL (NIH) |
20M bioactivity measurements for 2.4M distinct compounds and 15K protein targets |
Official site |
Mulit-Modal
Dataset (Paper) |
Description |
Link |
RadGenome-Chest CT (arXiv) |
A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs |
- |
OmniMedVQA (arXiv) |
131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets |
- |
SAT-DS (arXiv) |
11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLS |
Github |
PathChatInstruct (arXiv) |
257,004 instructions of pathology-specific queries with image and text |
- |
Chi-Med-VL (arXiv) |
580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in Chinese |
Github |
MedMD (arXiv) |
15.5M 2D scans and 180k 3D radiology scans with textual descriptions |
Github |
OpenPath (Nature) |
208,414 pathology images paired with natural language descriptions |
Huggingface |
Quilt-1M (arXiv) |
1M image-text pairs for histopathology |
Github |
Med-MMHL (arXiv) |
Human- and LLM-generated misinformation detection dataset |
Github |
Mol-Instructions (arXiv) |
148K molecule-oriented, 505K protein-oriented, and biomolecular text instructions |
Huggingface |
PathInstruct (arXiv) |
180K samples of LLM-generated instruction-following data |
Github |
PMC-VQA (arXiv) |
227K VQA pairs of 149K images of various modalities or diseases |
Github |
PMC-OA (arXiv) |
1.6M fine-grained biomedical image-text pairs |
Github |
PathCap (arXiv) |
142K pathology image-caption pairs from various sources |
Github |
SwissProtCLAP (arXiv) |
441K text-protein sequence pairs |
Github |
MIMIC-IV (Nature) |
Clinical information for hospital stays of over 60,000 patients |
Official site |
MIMIC-CXR (Nature) |
227,835 chest imaging studies with free-text reports for 65,379 patients |
PhysioNet |
TCGA |
A landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types |
Official site |