Data-Centric Foundation Models in Computational Healthcare

:fire::fire::fire: A survey on data-centric foundation models in computational healthcare

Last updated: 2024/10/08

:pencil: If you find this repo helps, please kindly cite our survey, thanks!

@article{zhang2024data,
  title={Data-Centric Foundation Models in Computational Healthcare: A Survey},
  author={Zhang, Yunkun and Gao, Jin and Tan, Zheling and Zhou, Lingfeng and Ding, Kexin and Zhou, Mu and Zhang, Shaoting and Wang, Dequan},
  journal={arXiv preprint arXiv:2401.02458},
  year={2024}
}

In this repository, we provide an up-to-date list of healthcare-related foundation models and datasets, which are also mentioned in our survey paper.

:book: Contents

Healthcare and medical foundation models
Datasets for foundation model
- Text
- Imaging
- Genomics
- Drug
- Multi-modal

Healthcare and Medical Foundation Models

A star (*) after the pre-training data shows that the authors constructed the data with more than three sources.

Language Models

Model	Subfield	Paper	Code	Base	Pre-Training Data
MMedLM 2	Medicine	Towards Building Multilingual Language Model for Medicine	Github	InternLM 2	MMedC*
BiMediX	Medicine	BiMediX: Bilingual Medical Mixture of Experts LLM	Github	Mixtral	BiMed1.3M*
Me LLaMA	Medicine	Me LLaMA: Foundation Large Language Models for Medical Applications	Github	LLaMA 2	*
BioMistral	Biomedicine	BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains	-	Mistral	PMC
PULSE	Medicine	-	Github	InternLM	*
Meditron	Medicine	Meditron-70B: Scaling Medical Pretraining for Large Language Models	Github	LLaMA 2	GAP-Replay*
Taiyi	Biomedicine	Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks	Github	Qwen	BigBio + CBLUE
BioMedGPT	Biomedicine	BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine	Github	LLaMA 2	S2ORC
Clinical LLaMA-LoRA	Clinic	Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain	-	LLaMA	MIMIC-IV
Med-PaLM 2	Clinic	Towards Expert-Level Medical Question Answering with Large Language Models	Google	PaLM 2	MedQA
PMC-LLaMA	Medicine	PMC-LLaMA: Towards Building Open-source Language Models for Medicine	Github	LLaMA	MedC
MedAlpaca	Medicine	MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data	Github	LLaMA	Medical Meadow
BenTsao (HuaTuo)	Biomedicine	HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge	Github	LLaMA	CMeKG
ChatDoctor	Medicine	ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge	Github	LLaMA	HealthCareMagic*
Clinical-T5	Clinic	Clinical-T5: Large Language Models Built Using Mimic Clinical Text	PhysioNet	T5	MIMIC-III + MIMIC-IV
Med-PaLM	Clinic	Large Language Models Encode Clinical Knowledge	Google	PaLM	MedQA
BioGPT	Biomedicine	BioGPT: Generative Pre-Trained Transformer for Biomedical Text Generation and Mining	Github	GPT-2	PubMed
BioLinkBERT	Biomedicine	Linkbert: Pretraining Language Models with Document Links	Github	BERT	PubMed
PubMedBERT	Biomedicine	Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing	Microsoft	BERT	PubMed
BioBERT	Biomedicine	BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining	Github	BERT	PubMed + PMC
BlueBERT	Biomedicine	An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining	Github	BERT	PubMed + MIMIC-III
Clinical BERT	Clinic	Publicly Available Clinical BERT Embeddings	Github	BERT	MIMIC-III
SciBERT	Biomedicine	SciBERT: A Pretrained Language Model for Scientific Text	Github	BERT	Semantic Scholar

Vision Models

Model	Subfield	Paper	Code	Base	Pre-Training Data
Prov-GigaPath	Pathology	A Whole-Slide Foundation Model for Digital Pathology from Real-World Data	Github	-	Prov-Path*
BEPH	Pathology	A Foundation Model for Generalizable Cancer Diagnosis and Survival Prediction from Histopathological Images	Github	BEiTv2	*
(No name)	Radiology	Foundation Model for Cancer Imaging Biomarkers	Github	SimCLR	*
VISION-MAE	Radiology	VISION-MAE: A Foundation Model for Medical Image Segmentation and Classification	-	MAE	*
RudolfV	Pathology	RudolfV: A Foundation Model by Pathologists for Pathologists	-	DINOv2	*
PathoDuet	Pathology	PathoDuet: Foundation Models for Pathological Slide Analysis of H&E and ICH Stains	Github	MoCo v3	TCGA + HyReCo + BCI
UNI	Pathology	A General-Purpose Self-Supervised Model for Computational Pathology	-	DINOv2	Mass-100K
REMEDIS	Radiology	Robust and Data-Efficient Generalization of Self-Supervised Machine Learning for Diagnostic Imaging	Github	SimCLR	MIMIC-IV + CheXpert
Virchow	Pathology	Virchow: A Million-Slide Digital Pathology Foundation Model	-	DINOv2	*
RETFound	Retinopathy	A Foundation Model for Generalizable Disease Detection from Retinal Images	Github	MAE	*
CTransPath	Pathology	Transformer-Based Unsupervised Contrastive Learning for Histopathological Image Classification	Github	-	TCGA + PAIP
HIPT	Pathology	Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning	Github	DINO	TCGA

Vision-Language Models

Model	Subfield	Paper	Code	Base	Pre-Training Data
Uni-Med	Medicine	Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE	-	CLIP + LLaMA 2	*
RadFound	Radiology	Expert-Level Vision-Language Foundation Model for Real-World Radiology and Comprehensive Evaluation	-	-	RadVLCorpus*
PRISM	Pathology	PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology	-	CoCa	*
Med-Gemini	Medicine	Capabilities of Gemini Models in Medicine	-	Gemini	*
EchoCLIP	Cardiology	Vision-Language Foundation Model for Echocardiogram Interpretation	Github	CLIP	*
ChemDFM	Chemistry	ChemDFM: Dialogue Foundation Model for Chemistry	-	LLaMA	PubMed + USPTO
CheXagent	Radiology	CheXagent: Towards a Foundation Model for Chest X-Ray Interpretation	Github	BLIP-2	CheXinstruct*
SAT	Radiology	One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts	Github	-	SAT-DS*
PathChat	Pathology	A Foundational Multimodal Vision Language AI Assistant for Human Pathology	-	LLaVA	PathChatInstruct*
Qilin-Med-VL	Radiology	Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare	Github	LLaVA	Chi-Med-VL*
CXR-CLIP	Radiology	CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training	Github	CLIP	MIMIC-CXR + CheXpert + ChestX-ray14
MaCo	Radiology	Enhancing Representation in Radiography-Reports Foundation Model: A Granular Alignment Algorithm Using Masked Contrastive Learning	-	MAE + CLIP	MIMIC-CXR
PathLDM	Pathology	PathLDM: Text conditioned Latent Diffusion Model for Histopathology	Github	Latent Diffusion	TCGA-BRCA + GPT-3.5
RadFM	Radiology	Towards Generalist Foundation Model for Radiology	Github	-	MedMD*
KAD	Radiology	Knowledge-Enhanced Visual-Language Pre-Training on Chest Radiology Images	Github	CLIP	MIMIC-CXR + UMLS
Med-Flamingo	Medicine	Med-Flamingo: A Multimodal Medical Few-Shot Learner	Github	Flamingo	MTB + PMC-OA
CONCH	Pathology	A Visual-Language Foundation Model for Computational Pathology	Github	CoCa	PubMed + PMC
QuiltNet	Pathology	Quilt-1M: One Million Image-Text Pairs for Histopathology	Github	CLIP	Quilt-1M*
PathAsst	Pathology	PathAsst: Redefining Pathology through Generative Foundation AI Assistant for Pathology	Github	CLIP	PathCap + PathInstruct*
PLIP	Pathology	A Visual-Language Foundation Model for Pathology Image Analysis Using Medical Twitter	Huggingface	CLIP	OpenPath*
MI-Zero	Pathology	Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images	Github	CLIP	ARCH
LLaVA-Med	Biomedicine	LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day	Github	LLaVA	PMC-15M + GPT-4
MedVInT	Biomedicine	PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering	Github	-	PMC-VQA*
PMC-CLIP	Biomedicine	PMC-CLIP: Contrastive Language-Image Pre-Training Using Biomedical Documents	Github	CLIP	PMC-OA*
BiomedCLIP	Biomedicine	Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing	Huggingface	CLIP	PMC-15M*
MedKLIP	Radiology	MedKLIP: Medical Knowledge Eenhanced Language-Image Pre-Training	Github	CLIP	MIMIC-CXR
MedCLIP	Medicine	MedCLIP: Contrastive Learning from Unpaired Medical Images and Text	Github	CLIP	CheXpert + MIMIC-CXR
CheXzero	Radiology	Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning	Github	CLIP	MIMIC-CXR
PubMedCLIP	Radiology	Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?	Github	CLIP	ROCO

Protein and Molecule Models

Model	Subfield	Paper	Code	Base	Pre-Training Data
nach0	Molecules	nach0: Multimodal Natural and Chemical Languages Foundation Model	Github	T5	*
MoleculeSTM	Drug	Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing	Github	CLIP	PubChem
AlphaMissense	Proteomics	Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense	Github	AlphaFold	PDB + UniRef
GET	Genomics	GET: A Foundation Model of Transcription across Human Cell Types	Huggingface	Transformer	*
GIT-Mol	Molecules	GIT-Mol: A Multi-Modal Large Language Model for Molecular Science with Graph, Image, and Text	Github	T5 + BLIP-2	PubChem
ESM-2	Proteomics	Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model	Github	Transformer	UniRef
AlphaFold 2	Proteomics	Highly Accurate Protein Structure Prediction with AlphaFold	Github	-	PDB + Uniclust30

Other Models

Model	Subfield	Paper	Code	Base	Pre-Training Data
OmniNA	Nucleotide sequence	OmniNA: A Foundation Model for Nucleotide Sequences	-	LLaMA	NCBI
LaBraM	EEG	Large Brain Model for Learning Generic Representations with Tremendous EEG Data in BCI	-	Transformer	*
Neuro-GPT	EEG	Neuro-GPT: Developing A Foundation Model for EEG	-	-	TUH EEG

Datasets for Foundation Model

Text

Dataset (Paper)	Description	Link
MedBench (arXiv)	A Chinese medical LLM benchmark with 300,901 Chinese questions covering 43 clinical specialties, combined with an automatic evaluation system	Official site
MMedBench (arXiv)	A multilingual medical QA benchmark, where questions are categorized into 21 topics	Github
MMedC (arXiv)	A multilingual medical corpus containing over 25.5B tokens	Github
BiMed1.3M (arXiv)	An English and Arabic bilingual dataset of 1.3M samples of medical QA and chat	Github
GAP-Replay (arXiv)	48.1B tokens from 4 medical corpora including guidelines, abstracts, papers, and replay	Github
Huatuo-26M (arXiv)	26M Chinese medical QA pairs	Github
Medical Meadow (arXiv)	16M medical QA pairs collected from 9 sources	Github
MultiMedQA (Nature)	6 existing and 1 online-collected medical QA dataset	Nature
BigBio (Nature)	126+ biomedical NLP datasets covering 13 task categories and 10+ languages	Github
MedMCQA (MLR)	194K multiple-choice questions covering 2.4K healthcare topics	Official site
MedQA-USMLE (MDPI)	61,097 multiple choice questions based on USMLE in three languages	Github
CBLUE (arXiv)	A Chinese biomedical language understanding evaluation benchmark with 18 datasets	Official site
BLURB (arXiv)	13 biomedical NLP datasets in 6 tasks	Official site
PubMedQA (arXiv)	1K expert-annotated, 61.2K unlabeled, and 211.3K artificially generated biomedical QA instances	Official site
BLUE (arXiv)	5 language tasks with 10 biomedical and clinical text datasets	Github
webMedQA (BMC)	63,284 real-world Chinese medical questions with over 300K answers	Github
MedMentions (arXiv)	4,392 papers annotated by experts with mentions of UMLS entities	Github
MIMIC-III (Nature)	Critical care data for over 40,000 patients	Official site
ClinicalTrials.gov	An online database of clinical research studies, including clinical trials and observational studies	Official site

Imaging

Dataset (Paper)	Description	Link
Mass-100K (arXiv)	100M tissue patches from 100,426 diagnostic H&E WSIs accross 20 major tissue types	-
RETFound (Nature)	Unannotated retinal images, containing 904,170 CFPs and 736,442 OCT scans	Nature
AbdomenAtlas-8K (arXiv)	8,448 CT volumes with per-voxel annotated eight abdominal organs	Github
Med-MNIST v2 (Nature)	12 2D and 6 3D datasets for biomedical image classification	Official site
EchoNet-Dynamic (Nature)	10,030 expert-annotated echocardiogram videos	Official site
CheXpert (arXiv)	224,316 chest radiographs of 65,240 patients	Official site
Kather Colon Dataset (PMC)	100K histological images of human colorectal cancer and healthy tissue	Zenodo
DeepLesion (PMC)	32K CT scans with annotations and semantic labels from radiological reports	NIH
ChestXray-NIHCC (arXiv)	100K radiographs with labels from more than 30,000 patients	NIH
ISIC	An archive containing 23K skin lesion images with labels & Imaging	Official site

Genomics

Dataset (Paper)	Description	Link
1000 Genomes Project (Nature)	A comprehensive catalog of human genetic variations	Official site
ENCODE (Nature)	A platform of genomics data and encyclopedia with integrative-level and ground-level annotations	NIH
dbSNP (NIH)	A collection of human single nucleotide variations, microsatellites, and small-scale insertions and deletions	NIH

Drug

Dataset (Paper)	Description	Link
DrugChat (arXiv)	143,517 question-answer pairs covering 10,834 drug compounds, collected from PubChem and ChEMBL	Github
PubChem (NIH)	A collection of 900+ sources of chemical information data	NIH
DrugBank (NIH)	A web-enabled structured database of molecular information about drugs	Official site
ChEMBL (NIH)	20M bioactivity measurements for 2.4M distinct compounds and 15K protein targets	Official site

Mulit-Modal

Dataset (Paper)	Description	Link
RadGenome-Chest CT (arXiv)	A dataset of 3D chest CT, including 197 organ-level segmentation masks, 665K multi-granularity grounded reports, and 1.3M grounded VQA pairs	-
OmniMedVQA (arXiv)	131,813 question-answering items with 120,530 images from 12 modalities and 26 human anatomical regions, collected from 75 medical datasets	-
SAT-DS (arXiv)	11,462 scans with 142,254 segmentation annotations spanning 8 human body regions from 31 medical image segmentation datasets, together with domain knowledge from e-Anatomy and UMLS	Github
PathChatInstruct (arXiv)	257,004 instructions of pathology-specific queries with image and text	-
Chi-Med-VL (arXiv)	580,014 image-text pairs and 469,441 question-answer pairs for general healthcare in Chinese	Github
MedMD (arXiv)	15.5M 2D scans and 180k 3D radiology scans with textual descriptions	Github
OpenPath (Nature)	208,414 pathology images paired with natural language descriptions	Huggingface
Quilt-1M (arXiv)	1M image-text pairs for histopathology	Github
Med-MMHL (arXiv)	Human- and LLM-generated misinformation detection dataset	Github
Mol-Instructions (arXiv)	148K molecule-oriented, 505K protein-oriented, and biomolecular text instructions	Huggingface
PathInstruct (arXiv)	180K samples of LLM-generated instruction-following data	Github
PMC-VQA (arXiv)	227K VQA pairs of 149K images of various modalities or diseases	Github
PMC-OA (arXiv)	1.6M fine-grained biomedical image-text pairs	Github
PathCap (arXiv)	142K pathology image-caption pairs from various sources	Github
SwissProtCLAP (arXiv)	441K text-protein sequence pairs	Github
MIMIC-IV (Nature)	Clinical information for hospital stays of over 60,000 patients	Official site
MIMIC-CXR (Nature)	227,835 chest imaging studies with free-text reports for 65,379 patients	PhysioNet
TCGA	A landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types	Official site

Yunkun-Zhang / Data-Centric-FM-Healthcare

readme