awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.28k stars 196 forks source link

Training stops without error #136

Closed kdutia closed 4 years ago

kdutia commented 4 years ago

I'm trying to find an optimal set of parameters by running dglke_train with various sets of parameters (randomly sampled), and on the first instance it keeps freezing at the same iteration.

I'm running the test through the exclamation (!) mode in a Jupyter notebook, so I can loop through different sampled parameters.

Repeating the test

I'm using a custom dataset but these are the parameters:

model = TransE_l1
LOG_INTERVAL=1000
BATCH_SIZE=1000
BATCH_SIZE_EVAL=16
NEG_SAMPLE_SIZE=200
NEG_SAMPLE_SIZE_EVAL=100000
LR= 0.1 
-adv= True 
hidden_dim= 50
regularization_coef= 2e-08
gamma= 10
neg_deg_sample=False

More info

Here are the last few steps before it freezes (for 10 minutes before I cancel it)

[proc 0][Train] 1000 steps take 8.256 seconds
[proc 0]sample: 1.353, forward: 4.006, backward: 1.711, update: 1.175
[proc 0][Train](35000/60000) average pos_loss: 0.19853664480149746
[proc 0][Train](35000/60000) average neg_loss: 0.2785924620358273
[proc 0][Train](35000/60000) average loss: 0.23856455320119857
[proc 0][Train](35000/60000) average regularization: 0.00012192570248589618
[proc 0][Train] 1000 steps take 8.269 seconds
[proc 0]sample: 1.278, forward: 3.986, backward: 1.712, update: 1.283
[proc 0][Train](36000/60000) average pos_loss: 0.19503579252958297
[proc 0][Train](36000/60000) average neg_loss: 0.27933850078843536
[proc 0][Train](36000/60000) average loss: 0.23718714690953493
[proc 0][Train](36000/60000) average regularization: 0.00012245436408556996
[proc 0][Train] 1000 steps take 8.305 seconds
[proc 0]sample: 1.346, forward: 4.012, backward: 1.712, update: 1.224
[proc 0][Train](37000/60000) average pos_loss: 0.19615361012518406
[proc 0][Train](37000/60000) average neg_loss: 0.27748048058338465
[proc 0][Train](37000/60000) average loss: 0.2368170451670885
[proc 0][Train](37000/60000) average regularization: 0.00012362484454206423
[proc 0][Train] 1000 steps take 8.305 seconds
[proc 0]sample: 1.270, forward: 3.999, backward: 1.733, update: 1.293
[proc 0][Train](38000/60000) average pos_loss: 0.19601027159392834
[proc 0][Train](38000/60000) average neg_loss: 0.2794102805918083
[proc 0][Train](38000/60000) average loss: 0.23771027632802724
[proc 0][Train](38000/60000) average regularization: 0.00012375975443137578
[proc 0][Train] 1000 steps take 8.283 seconds
[proc 0]sample: 1.310, forward: 3.903, backward: 1.766, update: 1.294
[proc 0][Train](39000/60000) average pos_loss: 0.19360717238485814
[proc 0][Train](39000/60000) average neg_loss: 0.2766080161612481
[proc 0][Train](39000/60000) average loss: 0.23510759409517049
[proc 0][Train](39000/60000) average regularization: 0.0001251919507922139
[proc 0][Train] 1000 steps take 8.287 seconds
[proc 0]sample: 1.269, forward: 3.998, backward: 1.742, update: 1.268
[proc 0][Train](40000/60000) average pos_loss: 0.19862385678291322
[proc 0][Train](40000/60000) average neg_loss: 0.279490821111016
[proc 0][Train](40000/60000) average loss: 0.2390573388412595
[proc 0][Train](40000/60000) average regularization: 0.00012537073031126055
[proc 0][Train] 1000 steps take 8.190 seconds
[proc 0]sample: 1.236, forward: 3.902, backward: 1.749, update: 1.293
[proc 0][Train](41000/60000) average pos_loss: 0.19015826864540578
[proc 0][Train](41000/60000) average neg_loss: 0.27666417042165997
[proc 0][Train](41000/60000) average loss: 0.23341121918708085
[proc 0][Train](41000/60000) average regularization: 0.00012650544225471094
[proc 0][Train] 1000 steps take 8.237 seconds
[proc 0]sample: 1.311, forward: 3.908, backward: 1.717, update: 1.291
[proc 0][Train](42000/60000) average pos_loss: 0.19738745559751988
[proc 0][Train](42000/60000) average neg_loss: 0.279010270354338
[proc 0][Train](42000/60000) average loss: 0.23819886273890734
[proc 0][Train](42000/60000) average regularization: 0.0001268844535225071
[proc 0][Train] 1000 steps take 8.367 seconds
[proc 0]sample: 1.301, forward: 4.038, backward: 1.755, update: 1.263
[proc 0][Train](43000/60000) average pos_loss: 0.19044273269176484
[proc 0][Train](43000/60000) average neg_loss: 0.2760635534534231
[proc 0][Train](43000/60000) average loss: 0.23325314317643642

Not sure why this is happening.

classicsong commented 4 years ago

Can you check following two things:

  1. The CPU usage and GPU usage when it 'freeze'
  2. The connection to the notebook server.
kdutia commented 4 years ago

The connection to the notebook server is ok. I've just reduced max_step to 40,000 and now it gets stuck at 22,000.

This is what I get when I run nvidia-smi on my EC2 machine.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0    51W / 300W |   1271MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     23371      C   ...ector-tBN8TxCn/bin/python     1269MiB |
+-----------------------------------------------------------------------------+
classicsong commented 4 years ago

Can you show me the entire cmdline?

kdutia commented 4 years ago

This is my latest experiment which is stuck on 22,000

/home/ubuntu/.local/share/virtualenvs/heritage-connector-tBN8TxCn/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
  warnings.warn(msg, warn_type)
Logs are being recorded at: /home/ubuntu/data/results/TransE_l1_hc1708_14/train.log
Reading train triples....
Finished. Read 1112834 train triples.
Reading valid triples....
Finished. Read 34417 valid triples.
Reading test triples....
Finished. Read 34417 test triples.
|Train|: 1112834
|valid|: 34417
|test|: 34417
Total initialize time 5.893 seconds
[proc 0][Train](1000/40000) average pos_loss: 0.9731047201752663
[proc 0][Train](1000/40000) average neg_loss: 0.8851320103555917
[proc 0][Train](1000/40000) average loss: 0.9291183650195599
[proc 0][Train](1000/40000) average regularization: 3.712290239400318e-05
[proc 0][Train] 1000 steps take 11.759 seconds
[proc 0]sample: 1.389, forward: 4.386, backward: 1.811, update: 4.162
[proc 0][Train](2000/40000) average pos_loss: 0.6279566985368729
[proc 0][Train](2000/40000) average neg_loss: 0.6747330973446369
[proc 0][Train](2000/40000) average loss: 0.6513448976278305
[proc 0][Train](2000/40000) average regularization: 4.70831001366605e-05
[proc 0][Train] 1000 steps take 8.641 seconds
[proc 0]sample: 1.267, forward: 4.299, backward: 1.802, update: 1.263
[proc 0][Train](3000/40000) average pos_loss: 0.39179811123013497
[proc 0][Train](3000/40000) average neg_loss: 0.48457677371799945
[proc 0][Train](3000/40000) average loss: 0.43818744249641894
[proc 0][Train](3000/40000) average regularization: 5.761738594082999e-05
[proc 0][Train] 1000 steps take 8.768 seconds
[proc 0]sample: 1.367, forward: 4.343, backward: 1.854, update: 1.194
[proc 0][Train](4000/40000) average pos_loss: 0.34823234072327613
[proc 0][Train](4000/40000) average neg_loss: 0.43337956032902003
[proc 0][Train](4000/40000) average loss: 0.39080595020949843
[proc 0][Train](4000/40000) average regularization: 6.410275710004499e-05
[proc 0][Train] 1000 steps take 8.561 seconds
[proc 0]sample: 1.273, forward: 4.279, backward: 1.805, update: 1.194
[proc 0][Train](5000/40000) average pos_loss: 0.28567829565703867
[proc 0][Train](5000/40000) average neg_loss: 0.381733529381454
[proc 0][Train](5000/40000) average loss: 0.33370591297745705
[proc 0][Train](5000/40000) average regularization: 7.077617696631933e-05
[proc 0][Train] 1000 steps take 8.606 seconds
[proc 0]sample: 1.320, forward: 4.235, backward: 1.804, update: 1.237
[proc 0][Train](6000/40000) average pos_loss: 0.27727535855770113
[proc 0][Train](6000/40000) average neg_loss: 0.359264762930572
[proc 0][Train](6000/40000) average loss: 0.31827006104588507
[proc 0][Train](6000/40000) average regularization: 7.580427336506546e-05
[proc 0][Train] 1000 steps take 8.738 seconds
[proc 0]sample: 1.313, forward: 4.382, backward: 1.802, update: 1.232
[proc 0][Train](7000/40000) average pos_loss: 0.25403493851423264
[proc 0][Train](7000/40000) average neg_loss: 0.340814183909446
[proc 0][Train](7000/40000) average loss: 0.297424561008811
[proc 0][Train](7000/40000) average regularization: 8.013840392231942e-05
[proc 0][Train] 1000 steps take 8.778 seconds
[proc 0]sample: 1.358, forward: 4.336, backward: 1.853, update: 1.220
[proc 0][Train](8000/40000) average pos_loss: 0.249679933860898
[proc 0][Train](8000/40000) average neg_loss: 0.32741397356987
[proc 0][Train](8000/40000) average loss: 0.28854695366322997
[proc 0][Train](8000/40000) average regularization: 8.428859609557549e-05
[proc 0][Train] 1000 steps take 8.737 seconds
[proc 0]sample: 1.307, forward: 4.384, backward: 1.814, update: 1.221
[proc 0][Train](9000/40000) average pos_loss: 0.2401422714293003
[proc 0][Train](9000/40000) average neg_loss: 0.32162084911391137
[proc 0][Train](9000/40000) average loss: 0.2808815600425005
[proc 0][Train](9000/40000) average regularization: 8.721390135906404e-05
[proc 0][Train] 1000 steps take 8.843 seconds
[proc 0]sample: 1.335, forward: 4.411, backward: 1.816, update: 1.269
[proc 0][Train](10000/40000) average pos_loss: 0.23346979524195194
[proc 0][Train](10000/40000) average neg_loss: 0.31008259259164334
[proc 0][Train](10000/40000) average loss: 0.2717761939018965
[proc 0][Train](10000/40000) average regularization: 9.087248668947723e-05
[proc 0][Train] 1000 steps take 8.752 seconds
[proc 0]sample: 1.290, forward: 4.396, backward: 1.831, update: 1.224
[proc 0][Train](11000/40000) average pos_loss: 0.23218116450309753
[proc 0][Train](11000/40000) average neg_loss: 0.3106829285062849
[proc 0][Train](11000/40000) average loss: 0.2714320463836193
[proc 0][Train](11000/40000) average regularization: 9.28232833830407e-05
[proc 0][Train] 1000 steps take 8.918 seconds
[proc 0]sample: 1.371, forward: 4.442, backward: 1.827, update: 1.266
[proc 0][Train](12000/40000) average pos_loss: 0.2225425555408001
[proc 0][Train](12000/40000) average neg_loss: 0.2998654997013509
[proc 0][Train](12000/40000) average loss: 0.2612040272951126
[proc 0][Train](12000/40000) average regularization: 9.625060645339545e-05
[proc 0][Train] 1000 steps take 8.826 seconds
[proc 0]sample: 1.322, forward: 4.444, backward: 1.856, update: 1.194
[proc 0][Train](13000/40000) average pos_loss: 0.22791376207768918
[proc 0][Train](13000/40000) average neg_loss: 0.3035700426399708
[proc 0][Train](13000/40000) average loss: 0.2657419020012021
[proc 0][Train](13000/40000) average regularization: 9.777809328079456e-05
[proc 0][Train] 1000 steps take 8.668 seconds
[proc 0]sample: 1.283, forward: 4.339, backward: 1.832, update: 1.205
[proc 0][Train](14000/40000) average pos_loss: 0.2136377188116312
[proc 0][Train](14000/40000) average neg_loss: 0.29430538304895165
[proc 0][Train](14000/40000) average loss: 0.25397155099362134
[proc 0][Train](14000/40000) average regularization: 0.00010068564637185773
[proc 0][Train] 1000 steps take 8.839 seconds
[proc 0]sample: 1.392, forward: 4.428, backward: 1.816, update: 1.192
[proc 0][Train](15000/40000) average pos_loss: 0.22163018888235092
[proc 0][Train](15000/40000) average neg_loss: 0.29803618866577747
[proc 0][Train](15000/40000) average loss: 0.2598331885784864
[proc 0][Train](15000/40000) average regularization: 0.00010206920657947194
[proc 0][Train] 1000 steps take 8.699 seconds
[proc 0]sample: 1.265, forward: 4.363, backward: 1.824, update: 1.237
[proc 0][Train](16000/40000) average pos_loss: 0.21076470874249936
[proc 0][Train](16000/40000) average neg_loss: 0.29174677131138743
[proc 0][Train](16000/40000) average loss: 0.25125573988258837
[proc 0][Train](16000/40000) average regularization: 0.0001043690236110706
[proc 0][Train] 1000 steps take 8.948 seconds
[proc 0]sample: 1.399, forward: 4.436, backward: 1.861, update: 1.241
[proc 0][Train](17000/40000) average pos_loss: 0.2166016393750906
[proc 0][Train](17000/40000) average neg_loss: 0.293723577266559
[proc 0][Train](17000/40000) average loss: 0.2551626083999872
[proc 0][Train](17000/40000) average regularization: 0.00010588835964153987
[proc 0][Train] 1000 steps take 8.634 seconds
[proc 0]sample: 1.272, forward: 4.379, backward: 1.822, update: 1.151
[proc 0][Train](18000/40000) average pos_loss: 0.20872869043052197
[proc 0][Train](18000/40000) average neg_loss: 0.2904951977562159
[proc 0][Train](18000/40000) average loss: 0.24961194440722465
[proc 0][Train](18000/40000) average regularization: 0.00010771663221385097
[proc 0][Train] 1000 steps take 8.670 seconds
[proc 0]sample: 1.306, forward: 4.291, backward: 1.819, update: 1.243
[proc 0][Train](19000/40000) average pos_loss: 0.21248555865883828
[proc 0][Train](19000/40000) average neg_loss: 0.29046530737914145
[proc 0][Train](19000/40000) average loss: 0.2514754327312112
[proc 0][Train](19000/40000) average regularization: 0.00010925246066472027
[proc 0][Train] 1000 steps take 8.893 seconds
[proc 0]sample: 1.336, forward: 4.475, backward: 1.819, update: 1.253
[proc 0][Train](20000/40000) average pos_loss: 0.20720807661116122
[proc 0][Train](20000/40000) average neg_loss: 0.28932265562191606
[proc 0][Train](20000/40000) average loss: 0.2482653663828969
[proc 0][Train](20000/40000) average regularization: 0.00011058572954789269
[proc 0][Train] 1000 steps take 8.636 seconds
[proc 0]sample: 1.322, forward: 4.292, backward: 1.838, update: 1.174
[proc 0][Train](21000/40000) average pos_loss: 0.2084350918084383
[proc 0][Train](21000/40000) average neg_loss: 0.2872589902477339
[proc 0][Train](21000/40000) average loss: 0.24784704087674617
[proc 0][Train](21000/40000) average regularization: 0.00011231124978075968
[proc 0][Train] 1000 steps take 8.813 seconds
[proc 0]sample: 1.273, forward: 4.403, backward: 1.821, update: 1.305
[proc 0][Train](22000/40000) average pos_loss: 0.20681360100209714
[proc 0][Train](22000/40000) average neg_loss: 0.2890164882643148
classicsong commented 4 years ago

The launch cmd line as dglke_train ... Which dgl version do you use?

kdutia commented 4 years ago

The latest release on pypi. Sorry! The command is:

dglke_train --max_step 40000 --model_name NAMES --data_path ~/data --save_path ~/data/results  --dataset hc1708 \
    --format raw_udd_htr --data_files train.txt valid.txt test.txt \
    --log_interval 1000 --batch_size 1024 --batch_size_eval 16 --neg_sample_size 16 \
    --lr LR --hidden_dim HIDDEN_DIM -rc REGULARIZATION_COEF -g GAMMA \
    --gpu 0 --mix_cpu_gpu --async_update --test --neg_sample_size_eval 100000 | tee results.txt

also the following make no difference to it working:

classicsong commented 4 years ago

I c, you use tee. The output will be batched. You may need to wait for test to finish.

kdutia commented 4 years ago

Thanks, I hadn't considered this.