Pretrained Molecular Representations - Training GIN prior passing it to InfoGraph

DeepGraphLearning / torchdrug

A powerful and flexible machine learning platform for drug discovery

https://torchdrug.ai/

Apache License 2.0

1.44k stars 200 forks source link

Pretrained Molecular Representations - Training GIN prior passing it to InfoGraph #71

Open vladimirkovacevic opened 2 years ago

vladimirkovacevic commented 2 years ago

In the Pretrained Molecular Representations tutorial GIN model was passed to InfoGraph: model = models.InfoGraph(gin_model, separate_model=False)

Should GIN be trained first, then passed to InfoGraph?

Oxer11 commented 2 years ago

Hi! You don't need to train the GIN first, since the InfoGraph itself defines a pretraining task. We wrap it with as a 'model' instead of a 'task' in TorchDrug to facilitate the interaction with other layers.

vladimirkovacevic commented 2 years ago

Thank you for the answer. I assumed that, but how exactly are "pretrained" weights obtained since "pretrain" parameter is passed only to loading of the dataset and not to the model? dataset = datasets.ClinTox("~/molecule-datasets/", node_feature="pretrain", edge_feature="pretrain")

"pretrain" argument results in invoking features.atom.pretrain R function for calculating molecular node features in molecule.py.

KiddoZhu commented 2 years ago

Hi! The arguments in the dataset refers to chemical features (e.g. atom number, formal charge), rather than anything computed by a neural network. pretrain means a specific combination of chemical features that is suggested for pretraining graph neural networks.

You may use other chemical features specifier, such as default, for pretraining. Note you need to remain the same feature specifier for training and test, otherwise the model can't recognize the input correctly.

tinymd commented 2 years ago

Hi! I am still confused about this pretrain argument. The atom representation is fixed if I use default chemical features specifier, then what's the meaning of pretrain?

vladimirkovacevic commented 2 years ago

@KiddoZhu, sorry, your last response does not address my question. So, in the Pretrained Molecular Representations example, when GIN is instantiated it has random weights, right? As such, it is passed to the InfoGraph. Setting _nodefeature="pretrain" to dataset object does not set weights for GIN. This does not seem to me like desired behavior. Can you please confirm or correct me if I'm wrong? Thanks!

KiddoZhu commented 2 years ago

node_feature has nothing to do with the weights of the network. It only defines the attribute graph.node_feature for every graph in that dataset, which will be used as the input to the network.

For example, the default node feature is a concatenation of several chemical properties, like the one-hot encoding of atom type, the mass of the atom, the formal charge of the atom, etc. For pretraining, the pretrain node feature exactly follows the original paper, but you may also try other features. No matter which node feature you use, you need to stick to the same feature during finetuning. Otherwise, the shape of the input mismatches the network.