DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)
MIT License
253 stars 28 forks source link

Dealing with proteins with multiple chains #17

Closed dongli96 closed 1 year ago

dongli96 commented 1 year ago

For proteins with multiple chains, did you split them by chain and input the splits into the model one by one, or directly input the whole proteins?

In the section "F ADDITIONAL EXPERIMENTAL RESULTS ON EC AND GO PREDICTION - Pretraining on different datasets" of your paper, you wrote:

Specifically, we extract 123,505 experimentally-determined protein structures from PDB whose resolutions are between 0.0 and 2.5 angstroms, and we further extract 305,265 chains from these proteins to construct the final dataset

which seems to implying that you trained the model on a bunch of single protein chains. However, meanwhile you did experiments of Enzyme Comission code prediction. To my knowledge, there are many enzymes containing more than one chain. It is impossible to split the enzyme into different chains and input into the model respectively (which hardly predicts the enzyme type correctly).

Oxer11 commented 1 year ago

Yes, your understanding is correct. We pre-train our model on protein single-chain databases. The Enzyme Commission dataset is taken from DeepFRI paper (https://github.com/flatironinstitute/DeepFRI). This is constructed from the SIFT database, which maps annotations to PDB chains. So all the proteins in the EC dataset is single-chain.

dongli96 commented 1 year ago

Your explanation cleared up my confusion very well. Thank you.