bowen-gao / DrugCLIP

[NeurIPS 2023] DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening
MIT License
66 stars 5 forks source link

DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening

License: MIT ArXiv

cover

Official code for the paper "DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening", accepted at Neural Information Processing Systems, 2023. Currently the code is a raw version, will be updated ASAP. If you have any inquiries, feel free to contact billgao0111@gmail.com

Requirements

same as Uni-Mol

rdkit version should be 2022.9.5

Data and checkpoints

https://drive.google.com/drive/folders/1zW1MGpgunynFxTKXC2Q4RgWxZmg6CInV?usp=sharing

It currently includes the train data, the trained checkpoint and the test data for DUD-E

Training data

The dataset for training is included in google drive: train_no_test_af.zip. It contains several files:


dick_pkt.txt: dictionary for pocket atom types

dict_mol.txt: dictionary for molecule atom types

train.lmdb: train dataset

valid.lmdb: validation dataset

Use py_scripts/lmdb_utils.py to read the lmdb file. The keys in the lmdb files and corresponding descriptions are shown below:


"atoms": "atom types for each atom in the ligand" 

"coordinates": "3D coordinates for each atom in the ligand generated by RDKit. Max number of conformations is 10"

"pocket_atoms": "atom types for each atom in the pocket"

"pocket_coordinates": "3D coordinates for each atom in the pocket"

"mol": "RDKit molecule object for the ligand"

"smi": "SMILES string for the ligand"

"pocket": "pdbid of the pocket",

The dataset is compiled from the PBDBind dataset, containing a combination of authentic protein-ligand complexes and those generated through HomoAug, a technique for augmenting data with homology-based transformations.

Test data

DUD-E

DUD-E
├── gene id
│   ├── receptor.pdb
│   ├── crystal_ligand.mol2
│   ├── actives_final.ism
│   ├── decoys_final.ism
│   ├── mols.lmdb (containing all actives and decoys)
│   ├── pocket.lmdb

PCBA

lit_pcba
├── target name
│   ├── PDBID_protein.mol2
│   ├── PDBID_ligand.mol2
│   ├── actives.smi
│   ├── inactives.smi
│   ├── mols.lmdb (containing all actives and inactives)
│   ├── pocket.lmdb

Data preprocessing

see py_scripts/write_dude_multi.py

HomoAug

Please refer to HomoAug directory for details

Train

bash drugclip.sh

Test

bash test.sh

Retrieval

bash retrieval.sh

In the google drive folder, you can find example file for pocket.lmdb and mols.lmdb under retrieval dir.

Citation

If you find our work useful, please cite our paper:

@inproceedings{gao2023drugclip,
    author = {Gao, Bowen and Qiang, Bo and Tan, Haichuan and Jia, Yinjun and Ren, Minsi and Lu, Minsi and Liu, Jingjing and Ma, Wei-Ying and Lan, Yanyan},
    title = {DrugCLIP: Contrasive Protein-Molecule Representation Learning for Virtual Screening},
    booktitle = {NeurIPS 2023},
    year = {2023},
    url = {https://openreview.net/forum?id=lAbCgNcxm7},
}