Pre-training on different datasets

DeepGraphLearning / GearNet

GearNet and Geometric Pretraining Methods for Protein Structure Representation Learning, ICLR'2023 (https://arxiv.org/abs/2203.06125)

MIT License

263 stars 27 forks source link

Pre-training on different datasets #48

Closed DuanhaoranCC closed 1 year ago

DuanhaoranCC commented 1 year ago

Hello, you discussed the results of pre-training on different datasets in the appendix. As we can see in Table 8, the performance is comparable with real PDB or alphafold (V1 or V2), but real PDB has only 300,000 structures and alphafold has 800,000 structures. Why the authors use more structures of alphafold in the main text? Finally, theoretically, the larger the dataset, the better the pre-training results, why Table 8 is not valid?

Oxer11 commented 1 year ago

Hi, thanks for the question.

I use the AlphaFoldDB for pre-training, since it contains more available structures than PDB. We agree that "the larger the dataset, the better the results" holds when the number of structures increase with a scale of 100 times. This may not hold for twice more structures. Also, the pre-training results depend on the model capcity.