awslabs / dgl-lifesci

Python package for graph neural networks in chemistry and biology
Apache License 2.0
730 stars 151 forks source link

Question on the pre-trained model name and code #166

Closed kangqiyue closed 2 years ago

kangqiyue commented 2 years ago

Hi, I want to use the pre-trained model, and when I loaded the pre-trained models, I'm confused that how these models were pre-trained.

The pre-trained model included four, as follows: image

The paper in Hu et al., 2019 include four self-supervised method for node level (for example: infomax, edge prediction, attr masking, context prediction); and 1 supervised method for graph level pre-training. In addition, I noticed the datasets were different, which were 2M (ZINC ~2M), and ~450k (chembl) for node level and graph level, respectively.

I am confused that the pre-trained model name in dgllife-sci, were "gin_supervised_contextpred", "gin_supervised_infomax" and so on. Were they pre-trained with self-supervised method for node level (using the 2M dataset)? If ture, I think the name could be "self_supervised_contextpred"(which may be better).

In additon, could you provide the pre-training code (from scratch, which I did not find in the source code of dgllife-sci)? Furthermore, it will be very kind if you could provide the time consumption for the pre-training process, which I want to explore in the further.

Thanks with best regards.

update

I think I found the self supervised mode of "attr masking" here: https://github.com/awslabs/dgl-lifesci/tree/master/examples/property_prediction/pretrain_gnns/chem

mufeili commented 2 years ago

Were they pre-trained with self-supervised method for node level (using the 2M dataset)? If ture, I think the name could be "self_supervised_contextpred"(which may be better).

They were pre-trained with both node-level self-supervised learning and graph-level supervised learning. See table 1 in Hu et al., 2019 and the DGL-LifeSci paper.

In additon, could you provide the pre-training code (from scratch, which I did not find in the source code of dgllife-sci)? Furthermore, it will be very kind if you could provide the time consumption for the pre-training process, which I want to explore in the further.

As you figured out in the update, there is an example code for self-supervised learning with attribute masking. The pre-trained models you mentioned were the ones released by Hu et al., 2019.

kangqiyue commented 2 years ago

@mufeili Thanks for your kind reply! Now I realize that the pre-trained models were trained with both node level and graph level tasks. Therefore they were "supervised" (correct name!).

Actually, I want to fine-tune my own model based on the self-supervised (node level ) model (without graph level tasks). If so, is it true that I could not use the pre-trained models prepared in dgl-lifesci? (Instead, I have to train the model from scratch for self-supervised node level task?)

In addition, when I looked into the code of pretrain_masking, I noticed that the pretrain dataset was prepared (convert smiles to graph) with a parameter of "add_self_loop=True" (show as follows). I was confused that why we needed a self_loop here? (In my opinion, when performing GIN, GCN or others, a self_loop (from source node to source) is unnecessary. Am I wrong?)

https://github.com/awslabs/dgl-lifesci/blob/f8a176414b21b72c5ca1f8c7eb8d64702432ae24/examples/property_prediction/pretrain_gnns/chem/pretrain_masking.py#L188 image

Thanks again for your kind help!

mufeili commented 2 years ago

Actually, I want to fine-tune my own model based on the self-supervised (node level ) model (without graph level tasks). If so, is it true that I could not use the pre-trained models prepared in dgl-lifesci? (Instead, I have to train the model from scratch for self-supervised node level task?)

You can use the pre-trained models released by the author here.

In addition, when I looked into the code of pretrain_masking, I noticed that the pretrain dataset was prepared (convert smiles to graph) with a parameter of "add_self_loop=True" (show as follows). I was confused that why we needed a self_loop here? (In my opinion, when performing GIN, GCN or others, a self_loop (from source node to source) is unnecessary. Am I wrong?)

I cannot remember the exact reasons. From my experience, self-loops in general help or do not harm the model performance.

kangqiyue commented 2 years ago

OK, that's very helpful for me! Thanks again for your kind assistance!