About CATH dataset pretraining

LirongWu / MAPE-PPI

Code for ICLR 2024 (Spotlight) paper "MAPE-PPI: Towards Effective and Efficient Protein-Protein Interaction Prediction via Microenvironment-Aware Protein Embedding"

MIT License

251 stars 44 forks source link

About CATH dataset pretraining #7

Open yuanhu246 opened 4 months ago

yuanhu246 commented 4 months ago

Dear author, you have a pre-trained model on github, on which dataset was this model pre-trained? In your paper, you mentioned using the CATH dataset for pre-training. I think it is an interesting dataset, but I am new to the bioinformatics field and am not familiar with the CATH dataset. How to download the CATH dataset and use it in your model? Please don't hesitate to give your advice

LirongWu commented 3 months ago

It is pre-trained on CATH. CATH is a dataset that contains both sequences and realistic structures.

To pre-train with customized data (e.g., CATH or AlphaFoldDB datasets), you can refer to the steps described in the README.

Download the CATH dataset from the official website (https://www.cathdb.info/).
Pre-process pre-training PDB files as done in ./raw_data/data_process.py and transform into three files:
Load pre-processed data and perform pretraining on it.

yuanhu246 commented 3 months ago

Thank you very much for your help. Now I have downloaded the PDB file. Could you please help me to check whether the PDB file I downloaded is correct? And the corresponding protein. {}. Sequences. Dictionary. CSV and protein actions. {}. TXT should go where to download?