DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)
MIT License
263 stars 27 forks source link

Request for guidance on preprocessing PDB files for model input #18

Closed stau-7001 closed 1 year ago

stau-7001 commented 1 year ago

Hello, I have come across your fascinating GitHub repository on protein structure pre-training, and I am excited to explore its potential for my own research. I noticed that the provided data is in HDF5 format, and there is no preprocessing code available for PDB files. I would like to use my own PDB files for inference with your model, but I am unsure how to preprocess them to match the expected input format.

Would you be able to provide some guidance or share a sample preprocessing script for converting PDB files to the required HDF5 format? This would greatly help me and other researchers who are interested in utilizing your work for various applications.

Thank you for your time and for sharing your valuable work with the community. I am looking forward to your response and any assistance you can provide.

Oxer11 commented 1 year ago

Hi! Thanks for your interst in our work!

Only in the Fold3D dataset we use HDF5 format for storing proteins, following the original dataset in IEConv. I don't have the preprocessing code for converting PDB files to HDF5 format, either.

For EC and GO datasets, we still use pdb files as default storage format. The codes for loading pdb files can be found in the docs of TorchProtein. I suggest to check the documents and tutorials on TorchProtein before using this repo.

Though for Fold3D dataset we only provide HDF5 format dataset, it would be no problem to run the GearNet_IEConv model on PDB format data. You just need to load the dataset with the load_pdbs() method and then feed it into the model as usual. Our dataset module will transform these files into data.Protein objects.

Hope this can help.