Closed kdutia closed 4 years ago
Can you check following two things:
The connection to the notebook server is ok. I've just reduced max_step
to 40,000 and now it gets stuck at 22,000.
This is what I get when I run nvidia-smi
on my EC2 machine.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 42C P0 51W / 300W | 1271MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 23371 C ...ector-tBN8TxCn/bin/python 1269MiB |
+-----------------------------------------------------------------------------+
Can you show me the entire cmdline?
This is my latest experiment which is stuck on 22,000
/home/ubuntu/.local/share/virtualenvs/heritage-connector-tBN8TxCn/lib/python3.7/site-packages/dgl/base.py:25: UserWarning: multigraph will be deprecated.DGL will treat all graphs as multigraph in the future.
warnings.warn(msg, warn_type)
Logs are being recorded at: /home/ubuntu/data/results/TransE_l1_hc1708_14/train.log
Reading train triples....
Finished. Read 1112834 train triples.
Reading valid triples....
Finished. Read 34417 valid triples.
Reading test triples....
Finished. Read 34417 test triples.
|Train|: 1112834
|valid|: 34417
|test|: 34417
Total initialize time 5.893 seconds
[proc 0][Train](1000/40000) average pos_loss: 0.9731047201752663
[proc 0][Train](1000/40000) average neg_loss: 0.8851320103555917
[proc 0][Train](1000/40000) average loss: 0.9291183650195599
[proc 0][Train](1000/40000) average regularization: 3.712290239400318e-05
[proc 0][Train] 1000 steps take 11.759 seconds
[proc 0]sample: 1.389, forward: 4.386, backward: 1.811, update: 4.162
[proc 0][Train](2000/40000) average pos_loss: 0.6279566985368729
[proc 0][Train](2000/40000) average neg_loss: 0.6747330973446369
[proc 0][Train](2000/40000) average loss: 0.6513448976278305
[proc 0][Train](2000/40000) average regularization: 4.70831001366605e-05
[proc 0][Train] 1000 steps take 8.641 seconds
[proc 0]sample: 1.267, forward: 4.299, backward: 1.802, update: 1.263
[proc 0][Train](3000/40000) average pos_loss: 0.39179811123013497
[proc 0][Train](3000/40000) average neg_loss: 0.48457677371799945
[proc 0][Train](3000/40000) average loss: 0.43818744249641894
[proc 0][Train](3000/40000) average regularization: 5.761738594082999e-05
[proc 0][Train] 1000 steps take 8.768 seconds
[proc 0]sample: 1.367, forward: 4.343, backward: 1.854, update: 1.194
[proc 0][Train](4000/40000) average pos_loss: 0.34823234072327613
[proc 0][Train](4000/40000) average neg_loss: 0.43337956032902003
[proc 0][Train](4000/40000) average loss: 0.39080595020949843
[proc 0][Train](4000/40000) average regularization: 6.410275710004499e-05
[proc 0][Train] 1000 steps take 8.561 seconds
[proc 0]sample: 1.273, forward: 4.279, backward: 1.805, update: 1.194
[proc 0][Train](5000/40000) average pos_loss: 0.28567829565703867
[proc 0][Train](5000/40000) average neg_loss: 0.381733529381454
[proc 0][Train](5000/40000) average loss: 0.33370591297745705
[proc 0][Train](5000/40000) average regularization: 7.077617696631933e-05
[proc 0][Train] 1000 steps take 8.606 seconds
[proc 0]sample: 1.320, forward: 4.235, backward: 1.804, update: 1.237
[proc 0][Train](6000/40000) average pos_loss: 0.27727535855770113
[proc 0][Train](6000/40000) average neg_loss: 0.359264762930572
[proc 0][Train](6000/40000) average loss: 0.31827006104588507
[proc 0][Train](6000/40000) average regularization: 7.580427336506546e-05
[proc 0][Train] 1000 steps take 8.738 seconds
[proc 0]sample: 1.313, forward: 4.382, backward: 1.802, update: 1.232
[proc 0][Train](7000/40000) average pos_loss: 0.25403493851423264
[proc 0][Train](7000/40000) average neg_loss: 0.340814183909446
[proc 0][Train](7000/40000) average loss: 0.297424561008811
[proc 0][Train](7000/40000) average regularization: 8.013840392231942e-05
[proc 0][Train] 1000 steps take 8.778 seconds
[proc 0]sample: 1.358, forward: 4.336, backward: 1.853, update: 1.220
[proc 0][Train](8000/40000) average pos_loss: 0.249679933860898
[proc 0][Train](8000/40000) average neg_loss: 0.32741397356987
[proc 0][Train](8000/40000) average loss: 0.28854695366322997
[proc 0][Train](8000/40000) average regularization: 8.428859609557549e-05
[proc 0][Train] 1000 steps take 8.737 seconds
[proc 0]sample: 1.307, forward: 4.384, backward: 1.814, update: 1.221
[proc 0][Train](9000/40000) average pos_loss: 0.2401422714293003
[proc 0][Train](9000/40000) average neg_loss: 0.32162084911391137
[proc 0][Train](9000/40000) average loss: 0.2808815600425005
[proc 0][Train](9000/40000) average regularization: 8.721390135906404e-05
[proc 0][Train] 1000 steps take 8.843 seconds
[proc 0]sample: 1.335, forward: 4.411, backward: 1.816, update: 1.269
[proc 0][Train](10000/40000) average pos_loss: 0.23346979524195194
[proc 0][Train](10000/40000) average neg_loss: 0.31008259259164334
[proc 0][Train](10000/40000) average loss: 0.2717761939018965
[proc 0][Train](10000/40000) average regularization: 9.087248668947723e-05
[proc 0][Train] 1000 steps take 8.752 seconds
[proc 0]sample: 1.290, forward: 4.396, backward: 1.831, update: 1.224
[proc 0][Train](11000/40000) average pos_loss: 0.23218116450309753
[proc 0][Train](11000/40000) average neg_loss: 0.3106829285062849
[proc 0][Train](11000/40000) average loss: 0.2714320463836193
[proc 0][Train](11000/40000) average regularization: 9.28232833830407e-05
[proc 0][Train] 1000 steps take 8.918 seconds
[proc 0]sample: 1.371, forward: 4.442, backward: 1.827, update: 1.266
[proc 0][Train](12000/40000) average pos_loss: 0.2225425555408001
[proc 0][Train](12000/40000) average neg_loss: 0.2998654997013509
[proc 0][Train](12000/40000) average loss: 0.2612040272951126
[proc 0][Train](12000/40000) average regularization: 9.625060645339545e-05
[proc 0][Train] 1000 steps take 8.826 seconds
[proc 0]sample: 1.322, forward: 4.444, backward: 1.856, update: 1.194
[proc 0][Train](13000/40000) average pos_loss: 0.22791376207768918
[proc 0][Train](13000/40000) average neg_loss: 0.3035700426399708
[proc 0][Train](13000/40000) average loss: 0.2657419020012021
[proc 0][Train](13000/40000) average regularization: 9.777809328079456e-05
[proc 0][Train] 1000 steps take 8.668 seconds
[proc 0]sample: 1.283, forward: 4.339, backward: 1.832, update: 1.205
[proc 0][Train](14000/40000) average pos_loss: 0.2136377188116312
[proc 0][Train](14000/40000) average neg_loss: 0.29430538304895165
[proc 0][Train](14000/40000) average loss: 0.25397155099362134
[proc 0][Train](14000/40000) average regularization: 0.00010068564637185773
[proc 0][Train] 1000 steps take 8.839 seconds
[proc 0]sample: 1.392, forward: 4.428, backward: 1.816, update: 1.192
[proc 0][Train](15000/40000) average pos_loss: 0.22163018888235092
[proc 0][Train](15000/40000) average neg_loss: 0.29803618866577747
[proc 0][Train](15000/40000) average loss: 0.2598331885784864
[proc 0][Train](15000/40000) average regularization: 0.00010206920657947194
[proc 0][Train] 1000 steps take 8.699 seconds
[proc 0]sample: 1.265, forward: 4.363, backward: 1.824, update: 1.237
[proc 0][Train](16000/40000) average pos_loss: 0.21076470874249936
[proc 0][Train](16000/40000) average neg_loss: 0.29174677131138743
[proc 0][Train](16000/40000) average loss: 0.25125573988258837
[proc 0][Train](16000/40000) average regularization: 0.0001043690236110706
[proc 0][Train] 1000 steps take 8.948 seconds
[proc 0]sample: 1.399, forward: 4.436, backward: 1.861, update: 1.241
[proc 0][Train](17000/40000) average pos_loss: 0.2166016393750906
[proc 0][Train](17000/40000) average neg_loss: 0.293723577266559
[proc 0][Train](17000/40000) average loss: 0.2551626083999872
[proc 0][Train](17000/40000) average regularization: 0.00010588835964153987
[proc 0][Train] 1000 steps take 8.634 seconds
[proc 0]sample: 1.272, forward: 4.379, backward: 1.822, update: 1.151
[proc 0][Train](18000/40000) average pos_loss: 0.20872869043052197
[proc 0][Train](18000/40000) average neg_loss: 0.2904951977562159
[proc 0][Train](18000/40000) average loss: 0.24961194440722465
[proc 0][Train](18000/40000) average regularization: 0.00010771663221385097
[proc 0][Train] 1000 steps take 8.670 seconds
[proc 0]sample: 1.306, forward: 4.291, backward: 1.819, update: 1.243
[proc 0][Train](19000/40000) average pos_loss: 0.21248555865883828
[proc 0][Train](19000/40000) average neg_loss: 0.29046530737914145
[proc 0][Train](19000/40000) average loss: 0.2514754327312112
[proc 0][Train](19000/40000) average regularization: 0.00010925246066472027
[proc 0][Train] 1000 steps take 8.893 seconds
[proc 0]sample: 1.336, forward: 4.475, backward: 1.819, update: 1.253
[proc 0][Train](20000/40000) average pos_loss: 0.20720807661116122
[proc 0][Train](20000/40000) average neg_loss: 0.28932265562191606
[proc 0][Train](20000/40000) average loss: 0.2482653663828969
[proc 0][Train](20000/40000) average regularization: 0.00011058572954789269
[proc 0][Train] 1000 steps take 8.636 seconds
[proc 0]sample: 1.322, forward: 4.292, backward: 1.838, update: 1.174
[proc 0][Train](21000/40000) average pos_loss: 0.2084350918084383
[proc 0][Train](21000/40000) average neg_loss: 0.2872589902477339
[proc 0][Train](21000/40000) average loss: 0.24784704087674617
[proc 0][Train](21000/40000) average regularization: 0.00011231124978075968
[proc 0][Train] 1000 steps take 8.813 seconds
[proc 0]sample: 1.273, forward: 4.403, backward: 1.821, update: 1.305
[proc 0][Train](22000/40000) average pos_loss: 0.20681360100209714
[proc 0][Train](22000/40000) average neg_loss: 0.2890164882643148
The launch cmd line as dglke_train ...
Which dgl version do you use?
The latest release on pypi. Sorry! The command is:
dglke_train --max_step 40000 --model_name NAMES --data_path ~/data --save_path ~/data/results --dataset hc1708 \
--format raw_udd_htr --data_files train.txt valid.txt test.txt \
--log_interval 1000 --batch_size 1024 --batch_size_eval 16 --neg_sample_size 16 \
--lr LR --hidden_dim HIDDEN_DIM -rc REGULARIZATION_COEF -g GAMMA \
--gpu 0 --mix_cpu_gpu --async_update --test --neg_sample_size_eval 100000 | tee results.txt
also the following make no difference to it working:
-adv
I c, you use tee. The output will be batched. You may need to wait for test to finish.
Thanks, I hadn't considered this.
I'm trying to find an optimal set of parameters by running
dglke_train
with various sets of parameters (randomly sampled), and on the first instance it keeps freezing at the same iteration.I'm running the test through the exclamation (
!
) mode in a Jupyter notebook, so I can loop through different sampled parameters.Repeating the test
I'm using a custom dataset but these are the parameters:
More info
Here are the last few steps before it freezes (for 10 minutes before I cancel it)
Not sure why this is happening.