Shen-Lab / GraphCL

[NeurIPS 2020] "Graph Contrastive Learning with Augmentations" by Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, Yang Shen
MIT License
548 stars 103 forks source link

Questions about the transferlearning experiments #21

Closed ha-lins closed 3 years ago

ha-lins commented 3 years ago

Hi @yyou1996,

I tired pretraining on transferlearning experiments yet found: 1) on Bio benchmark, the loss and acc are always 0.0 and nan; 2) On chem dataset, the training process stopped at the 2nd epoch. Could you pls check the code? Btw, the finetuning command worked well. Thanks a lot!

image

image

Ps: My environment is cuda=10.0 and pyG has the same version just except it is for cu100.

yyou1996 commented 3 years ago

Hi @ha-lins,

The correct command to run is python pretrain_graphcl.py --aug1 random --aug2 random (see https://github.com/Shen-Lab/GraphCL/tree/master/transferLearning_MoleculeNet_PPI#pre-training), otherwise the default aug is none.

And for the acc part, since pretraining does not necessarily related to supervised acc, I just let acc=0 and print. So just ignore it. https://github.com/Shen-Lab/GraphCL/blob/7eefcc3ca3e0c9a579fd17bcb06fd28df9733312/transferLearning_MoleculeNet_PPI/bio/pretrain_graphcl.py#L103

ha-lins commented 3 years ago

Hi @yyou1996,

I met the same problem if I set the aug. parameter and the training process stopped at the epoch 2 as well. Could you pls give some suggestions? Thanks!

image

yyou1996 commented 3 years ago

Hi @ha-lins,

Is it also the case if you set both as random?

ha-lins commented 3 years ago

Yes, it's the same case if I set both as random @yyou1996 : image

It's puzzling since the first epoch runs well while the second epoch not.

yyou1996 commented 3 years ago

@ha-lins

I suspect might be the memory issue. Would you try adding del dataset1 and del dataset 2 at the end of this train() function (before return)? Also monitor the memory (e.g. use htop) to see whether it is the problem. https://github.com/Shen-Lab/GraphCL/blob/d857849d51bb168568267e07007c0b0c8bb6d869/transferLearning_MoleculeNet_PPI/chem/pretrain_graphcl.py#L75

ha-lins commented 3 years ago

@yyou1996 ,

Thanks for your solution. I tried it but it didn't work. With htop, the memory looks affordable in epoch 1 while the main process and all sub-process missed suddenly in epoch 2. I ended the job manually and found it always stopped at the same place of shuffle and get():

image

Is there any other ideas? Thanks!

yyou1996 commented 3 years ago

@ha-lins

Looks like the process in your machine is somehow killed. I just check their shuffle function and they do have copy applied. So my suspect might be: in epoch 1 memory is under limit, everything work ok. When in epoch 2, with in shuffle there is copy that double(?) the memory, leading to the process killed.

This might be solved if at the end of epoch 1, all memory of dataset and dataloader are clear. So I would try del dataset1, del dataset2, del dataloader1, del dataloader2. And according to this link probably gc.collect() is also needed after.

Last last, FYI the new version https://github.com/Shen-Lab/GraphCL_Automated/tree/master/transferLearning_MoleculeNet_PPI do not apply copy function anymore so it shouldn't have the above problem.

ha-lins commented 3 years ago

@yyou1996

gc.collect() is somehow helpful. But I have to kill the job manually (ctrl+c), then it can go on. If not, it still faces the stopping problem. In other words, I have to kill manually for each epoch. Is there any automatic quit way?

image

The new version of GraphCL_Automated is good while it needs too much revision. So I would try the first solution further.

yyou1996 commented 3 years ago

@ha-lins

Looks like dataloader is stuck in multiple process. Try to set num_workers=1 for dataloader? (e.g. https://github.com/Shen-Lab/GraphCL/blob/d857849d51bb168568267e07007c0b0c8bb6d869/transferLearning_MoleculeNet_PPI/chem/pretrain_graphcl.py#L83)

yyou1996 commented 3 years ago

Hi @ha-lins,

Does it solve the problem? For faster feedback recycle maybe you can email me for discussion (yuning.you@tamu.edu). I will post our solution in this issue lastly.

ha-lins commented 3 years ago

@yyou1996,

The problem was not solved by set num_workers = 1 or 0. I change to another GPU machine, rtx 3090 with this environment and it works now. My previous machine is RTX titan and the memory looks ok. So I actually did not figure out the reason.

Anyway, thanks for your continual help. : )