training time - program seems to hang

LARS-research / AutoSF

Y. Zhang, Q. Yao, J. Kwok. Bilinear Scoring Function Search for Knowledge Graph Learning. TPAMI 2022

68 stars 12 forks source link

training time - program seems to hang #1

Closed jpainam closed 4 years ago

jpainam commented 4 years ago

Hi, I'm trying to reproduce your work, I'm using 4 GeForce RTX with 10G of memory. But after few minutes, the training seems to hang.

 python train.py --task_dir KG_Data/WN18 --optim adagrad --lamb 0.000282 --lr 0.37775 --n_dim 64 --n_epoch 250 --n_batch 1024 --epoch_per_test 50 --test_batch_size 50 --thres 0.0 --parrel 5 --decay_rate 0.99456
B=4 Iter 1      sampled 5 candidate state for evaluate 12929
new: 0 [3 2 0 1] 4
new: 1 [0 2 1 3] 4
new: 2 [2 1 3 0] 4
new: 3 [2 3 0 1] 4
new: 4 [0 1 2 3] 4

nvidia-smi only shows a 1821MB memory used. So i guessed it's not the GPU How long do you think the training takes?

Thanks

yzhangee commented 4 years ago

Hi, sorry about just noticing this question.

Based on Table VII in our paper, the running time of a single model on WN18 takes about 20min.

We only print the evaluated outputs after a model is fully evaluated. Since there are several models that need to be trained in parallel, it will be messy if we print out too many logs. You can add some print out for your usage.

jpainam commented 4 years ago

Hi. I followed the readme and run bash run.sh. But the training took 5 days on 4GPUs GeForce RTX # 10GB. Anything I'm doing wrong?

Thank you.

yzhangee commented 4 years ago

That makes sence. Our code runs on 8 GPUs in parallel. Besides, you do not need to search from B=4 to B=16. Generally, it's possible to get a good structure with B<=10.

jpainam commented 4 years ago

Ok, you mean, here https://github.com/AutoML-4Paradigm/AutoSF/blob/45475f97a89dcd9bba69a695bf136d42612bfcd3/train.py#L92 I should stop at B<= 10. I didn't do any modification to your code. So, i guess it's running in parallel

yzhangee commented 4 years ago

Yes, you can add your own stopping criterion, according to B, time, or some other criteria. The basic principle is that we do not need to do an exhaustive search to obtain a good structure.

jpainam commented 4 years ago

thanks, i'll try with B <= 10 and let you know. Please, can you mention the other criteria that can be modified in the readme or here? so as to help thanks