PharMolix / OpenBioMed

MIT License
700 stars 80 forks source link

OpenBioMed

English | 中文

News 🎉

Table of contents

Introduction

This repository holds OpenBioMed, a Python deep learning toolkit for AI-empowered biomedicine. OpenBioMed provides easy access to multimodal biomedical data, i.e. molecular structures, transcriptomics, knowledge graphs and biomedical texts for molecules, proteins, and single cells. OpenBioMed supports a wide range of downstream applications, ranging from traditional AI drug discovery tasks to newly-emerged multimodal challenges.

OpenBioMed provide researchers with easy-to-use APIs to:

Key features of OpenBioMed include:

The following table shows the supported tasks, datasets and models in OpenBioMed. This is a continuing effort and we are working on further growing the list.

Task Supported Datasets Supported Models
Cross-modal Retrieval PCdes KV-PLM, SciBERT, MoMu, GraphMVP, MolFM
Molecule Captioning ChEBI-20 MolT5, MoMu, GraphMVP, MolFM, BioMedGPT
Text-based Molecule Generation ChEBI-20 MolT5, SciBERT, MoMu, MolFM
Molecule Question Answering ChEMBL-QA MolT5, MolFM, BioMedGPT
Protein Question Answering UniProtQA BioMedGPT
Cell Type Classification Zheng68k, Baron scBERT, CellLM
Single Cell Drug Response Prediction GDSC DeepCDR, TGSA, CellLM
Molecule Property Prediction MoleculeNet MolCLR, GraphMVP, MolFM, DeepEIK, BioMedGPT
Drug-target Binding Affinity Prediction Yamanishi08, BMKG-DTI, DAVIS, KIBA DeepDTA, MGraphDTA, DeepEIK
Protein-protein Interaction Prediction SHS27k, SHS148k, STRING PIPR, GNN-PPI, OntoProtein

Installation

  1. (Optional) Creating conda environment:
conda create -n OpenBioMed python=3.9
conda activate OpenBioMed
  1. Install required packages:
pip install -r requirements.txt
  1. Install Pyg dependencies:
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-(your_torch_version)+(your_cuda_version).html
pip install torch-geometric
# If you have issues installing the above PyTorch-related packages, instructions at https://pytorch.org/get-started/locally/ and https://github.com/pyg-team/pytorch_geometric may help. You may find it convenient to directly install PyTorch Geometric and its extensions from wheels available at https://data.pyg.org/whl/.

Note: additional packages may be required for some downstream tasks.

Quick Start

Checkout our Jupytor notebooks and documentations for a quick start!

Name Description
BioMedGPT-10B Inference Example of using BioMedGPT-10B to answer questions about molecules and proteins.
Cross-modal Retrieval with MolFM Example of using MolFM to retrieve the most related text descriptions for a molecule.
Text-based Molecule Generation with MolT5 Example of using MolT5 to generate the SMILES string of a molecule based on text description.
Cell Type classification with CellLM Example of using fine-tuned CellLM to classify cell types.
Molecule Property prediction Training & testing pipeline of the molecule propery prediction task
Drug-response prediction Training & testing pipeline of the drug-response prediction task
Drug-target binding affinity prediction Training & testing pipeline of the drug-target binding affinity prediction task
Molecule captioning Training & testing pipeline of the molecule captioning task

Limitations

This repository holds BioMedGPT-LM-7B and BioMedGPT-10B, and we emphasize the responsible and ethical use of these model. BioMedGPT should NOT be used to provide services to the general public. Generating any content that violates applicable laws and regulations, such as inciting subversion of state power, endangering national security and interests, propagating terrorism, extremism, ethnic hatred and discrimination, violence, pornography, or false and harmful information, etc. is strictly prohibited. BioMedGPT is not liable for any consequences arising from any content, data, or information provided or published by users.

License

This repository is licensed under the MIT License. The use of BioMedGPT-LM-7B and BioMedGPT-10B models is accompanied with Acceptable Use Policy.

Contact Us

We are looking forward to user feedback to help us improve our framework. If you have any technical questions or suggestions, please feel free to open an issue. For commercial support or collaboration, please contact opensource@pharmolix.com.

Cite Us

If you find our open-sourced code and models helpful to your research, please consider giving this repository a 🌟star and 📎citing the following articles. Thank you for your support!

To cite OpenBioMed:
@misc{OpenBioMed_code,
      author={Luo, Yizhen and Yang, Kai and Hong, Massimo and Liu, Xing Yi and Zhao, Suyuan and Zhang, Jiahuan and Wu, Yushuai and Nie, Zaiqing},
      title={Code of OpenBioMed},
      year={2023},
      howpublished={\url{https://github.com/BioFM/OpenBioMed.git}}
}
To cite BioMedGPT:
@misc{luo2023biomedgpt,
      title={BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine}, 
      author={Yizhen Luo and Jiahuan Zhang and Siqi Fan and Kai Yang and Yushuai Wu and Mu Qiao and Zaiqing Nie},
      year={2023},
      eprint={2308.09442},
      archivePrefix={arXiv},
      primaryClass={cs.CE}
}
To cite DeepEIK:
@misc{luo2023empowering,
      title={Empowering AI drug discovery with explicit and implicit knowledge}, 
      author={Yizhen Luo and Kui Huang and Massimo Hong and Kai Yang and Jiahuan Zhang and Yushuai Wu and Zaiqing Nie},
      year={2023},
      eprint={2305.01523},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
To cite MolFM:
@misc{luo2023molfm,
      title={MolFM: A Multimodal Molecular Foundation Model}, 
      author={Yizhen Luo and Kai Yang and Massimo Hong and Xing Yi Liu and Zaiqing Nie},
      year={2023},
      eprint={2307.09484},
      archivePrefix={arXiv},
      primaryClass={q-bio.BM}
}
To cite CellLM:
@misc{zhao2023largescale,
      title={Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning}, 
      author={Suyuan Zhao and Jiahuan Zhang and Zaiqing Nie},
      year={2023},
      eprint={2306.04371},
      archivePrefix={arXiv},
      primaryClass={cs.CE}
}
To cite LangCell:
@misc{zhao2024langcell,
      title={LangCell: Language-Cell Pre-training for Cell Identity Understanding}, 
      author={Suyuan Zhao and Jiahuan Zhang and Yizhen Luo and Yushuai Wu and Zaiqing Nie},
      year={2024},
      eprint={2405.06708},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN}
}