Closed dongli96 closed 1 year ago
Yes, your understanding is correct. We pre-train our model on protein single-chain databases. The Enzyme Commission dataset is taken from DeepFRI paper (https://github.com/flatironinstitute/DeepFRI). This is constructed from the SIFT database, which maps annotations to PDB chains. So all the proteins in the EC dataset is single-chain.
Your explanation cleared up my confusion very well. Thank you.
For proteins with multiple chains, did you split them by chain and input the splits into the model one by one, or directly input the whole proteins?
In the section "F ADDITIONAL EXPERIMENTAL RESULTS ON EC AND GO PREDICTION - Pretraining on different datasets" of your paper, you wrote:
which seems to implying that you trained the model on a bunch of single protein chains. However, meanwhile you did experiments of Enzyme Comission code prediction. To my knowledge, there are many enzymes containing more than one chain. It is impossible to split the enzyme into different chains and input into the model respectively (which hardly predicts the enzyme type correctly).