training freeze, no error

walton-wang929 commented 4 years ago

hello, guys, My training freezed at the same epoch, I tried two times. below is the output before freezing:

[proc 1][Train] 100 steps take 20.119 seconds [proc 1]sample: 1.558, forward: 16.974, backward: 0.704, update: 0.882 [proc 11]Train average pos_loss: 0.5498463898897171 [proc 11]Train average neg_loss: 1.007874653339386 [proc 11]Train average loss: 0.7788605213165283 [proc 11]Train average regularization: 0.0007757890276843682 [proc 11][Train] 100 steps take 19.812 seconds [proc 11]sample: 1.326, forward: 16.447, backward: 0.685, update: 1.353 [proc 6]Train average pos_loss: 0.6275217425823212 [proc 6]Train average neg_loss: 1.0286491405963898 [proc 6]Train average loss: 0.8280854398012161 [proc 6]Train average regularization: 0.0007514445093693211 [proc 6][Train] 100 steps take 18.929 seconds [proc 6]sample: 1.393, forward: 15.755, backward: 0.674, update: 1.106 [proc 18]Train average pos_loss: 0.593240208029747 [proc 18]Train average neg_loss: 1.013009267449379 [proc 18]Train average loss: 0.8031247377395629 [proc 18]Train average regularization: 0.0007214217411819846 [proc 18][Train] 100 steps take 18.691 seconds [proc 18]sample: 1.266, forward: 15.496, backward: 0.878, update: 1.049 [proc 11]Train average pos_loss: 0.5426840874552726 [proc 11]Train average neg_loss: 0.957381341457367 [proc 11]Train average loss: 0.7500327146053314 [proc 11]Train average regularization: 0.0007891122740693391 [proc 11][Train] 100 steps take 12.773 seconds [proc 11]sample: 0.926, forward: 10.123, backward: 0.730, update: 0.993

the execute cmd is DGLBACKEND=pytorch dglke_train --model_name TransE_l2 --data_path ./data/360KG_V2/ --format udd_hrt --dataset 360KG --data_files entities.dict relation.dict train.txt valid.txt test.txt --save_path ./run/360KG --max_step 320000 --batch_size 1000 --batch_size_eval 16 --neg_sample_size 200 --log_interval 100 --hidden_dim 400 --gamma 19.9 --lr 0.1 --regularization_coef 1.00E-09 --test -adv --mix_cpu_gpu --num_proc 40 --num_thread 20 --rel_part --force_sync_interval 1000 --gpu 0 1 2 3 4 5 6 7 --regularization_coef 1e-9 --neg_sample_size_eval 10000 --no_eval_filter

the cpu and gpu output is like:

classicsong commented 4 years ago

Which Pytorch and DGL version are you using?

walton-wang929 commented 4 years ago

dgl.version '0.4.3' torch.version '1.7.0'

classicsong commented 4 years ago

One possible reason: You used --force_sync_interval 1000 but did not add --async_update flag. --force_sync_interval should work together with --async_update.

walton-wang929 commented 4 years ago

I checked the train code, it looks like async_update is for multi-GPU training parameters updating? but before I haven't added this --async_update, and it worked. anyway, I will add it and try to see what happened. ` self.add_argument('--force_sync_interval', type=int, default=-1, help='We force a synchronization between processes every x steps for'\ 'multiprocessing training. This potentially stablizes the training process' 'to get a better performance. For multiprocessing training, it is set to 1000 by default.')

self.add_argument('--async_update', action='store_true', help='Allow asynchronous update on node embedding for multi-GPU training.'\ 'This overlaps CPU and GPU computation to speed up.') `

awslabs / dgl-ke

training freeze, no error #166