Closed ha-lins closed 3 years ago
Hi @ha-lins,
The correct command to run is python pretrain_graphcl.py --aug1 random --aug2 random
(see https://github.com/Shen-Lab/GraphCL/tree/master/transferLearning_MoleculeNet_PPI#pre-training), otherwise the default aug is none.
And for the acc part, since pretraining does not necessarily related to supervised acc, I just let acc=0 and print. So just ignore it. https://github.com/Shen-Lab/GraphCL/blob/7eefcc3ca3e0c9a579fd17bcb06fd28df9733312/transferLearning_MoleculeNet_PPI/bio/pretrain_graphcl.py#L103
Hi @yyou1996,
I met the same problem if I set the aug.
parameter and the training process stopped at the epoch 2
as well. Could you pls give some suggestions? Thanks!
Hi @ha-lins,
Is it also the case if you set both as random
?
Yes, it's the same case if I set both as random
@yyou1996 :
It's puzzling since the first epoch runs well while the second epoch not.
@ha-lins
I suspect might be the memory issue. Would you try adding del dataset1
and del dataset 2
at the end of this train()
function (before return)? Also monitor the memory (e.g. use htop
) to see whether it is the problem.
https://github.com/Shen-Lab/GraphCL/blob/d857849d51bb168568267e07007c0b0c8bb6d869/transferLearning_MoleculeNet_PPI/chem/pretrain_graphcl.py#L75
@yyou1996 ,
Thanks for your solution. I tried it but it didn't work. With htop
, the memory looks affordable in epoch 1
while the main process and all sub-process missed suddenly in epoch 2
. I ended the job manually and found it always stopped at the same place of shuffle
and get():
Is there any other ideas? Thanks!
@ha-lins
Looks like the process in your machine is somehow killed. I just check their shuffle
function and they do have copy
applied. So my suspect might be: in epoch 1 memory is under limit, everything work ok. When in epoch 2, with in shuffle
there is copy
that double(?) the memory, leading to the process killed.
This might be solved if at the end of epoch 1, all memory of dataset and dataloader are clear. So I would try del dataset1, del dataset2, del dataloader1, del dataloader2
. And according to this link probably gc.collect()
is also needed after.
Last last, FYI the new version https://github.com/Shen-Lab/GraphCL_Automated/tree/master/transferLearning_MoleculeNet_PPI do not apply copy
function anymore so it shouldn't have the above problem.
@yyou1996
gc.collect()
is somehow helpful. But I have to kill the job manually (ctrl+c
), then it can go on. If not, it still faces the stopping problem. In other words, I have to kill manually for each epoch. Is there any automatic quit way?
The new version of GraphCL_Automated is good while it needs too much revision. So I would try the first solution further.
@ha-lins
Looks like dataloader is stuck in multiple process. Try to set num_workers=1
for dataloader? (e.g. https://github.com/Shen-Lab/GraphCL/blob/d857849d51bb168568267e07007c0b0c8bb6d869/transferLearning_MoleculeNet_PPI/chem/pretrain_graphcl.py#L83)
Hi @ha-lins,
Does it solve the problem? For faster feedback recycle maybe you can email me for discussion (yuning.you@tamu.edu). I will post our solution in this issue lastly.
@yyou1996,
The problem was not solved by set num_workers = 1 or 0
. I change to another GPU machine, rtx 3090 with this environment and it works now. My previous machine is RTX titan and the memory looks ok. So I actually did not figure out the reason.
Anyway, thanks for your continual help. : )
Hi @yyou1996,
I tired pretraining on transferlearning experiments yet found: 1) on Bio benchmark, the loss and acc are always 0.0 and nan; 2) On chem dataset, the training process stopped at the 2nd epoch. Could you pls check the code? Btw, the finetuning command worked well. Thanks a lot!
Ps: My environment is
cuda=10.0
and pyG has the same version just except it is forcu100
.