awslabs / dgl-ke

High performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings.
https://dglke.dgl.ai/doc/
Apache License 2.0
1.27k stars 195 forks source link

Training on Multi-GPU #218

Closed DJRavinszkha closed 3 years ago

DJRavinszkha commented 3 years ago

I am kind of new to this, and have been attempting to use pykeen to compute KG embeddings. Nevertheless, it was taking a long time and I realised that running this on multi-GPU is better, and as such have switched to DGL-KE. I have a graph of 10 million edges, on which I am trying to predict new links with RotatE and TransR. So far I have been using the following:

!DGLBACKEND=pytorch dglke_train --dataset CoV-KG --data_path ./train --data_files kg_train.tsv kg_valid.tsv kg_test.tsv \
--format 'raw_udd_hrt' --model_name RotatE --batch_size 1024 --neg_sample_size 256 --hidden_dim 400 \
--gamma 12.0 --lr 0.1 --max_step 10000 --log_interval 100 --batch_size_eval 16 -adv \
--regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 --num_proc 4 \
--neg_sample_size_eval 10000 --async_update

yet the p3.8xlarge AWS EC2 instance shows that I am utilising the GPU's in the following manner.

Thu Jun 10 07:02:22 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.119.03 Driver Version: 450.119.03 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 | | N/A 47C P0 57W / 300W | 16056MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 | | N/A 47C P0 56W / 300W | 1350MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 | | N/A 45C P0 60W / 300W | 1350MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 | | N/A 48C P0 58W / 300W | 1350MiB / 16160MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 6552 C ...c2-user/pyenv/bin/python3 16053MiB | | 1 N/A N/A 37264 C /usr/bin/python3 1347MiB | | 2 N/A N/A 37298 C /usr/bin/python3 1347MiB | | 3 N/A N/A 37334 C /usr/bin/python3 1347MiB | +-----------------------------------------------------------------------------+

Is this correct, or is there a way to improve the way I am utilising the GPU's?

Finally, I am trying to replicate the work conducted in this study, in which they predicted new links in a KG of 15million edges in under 40 minutes with a p3.16xlarge instance. Given that my KG is 10 million edges and I am using a p3.8xlarge instance, how much longer would it take for me to run these computations? My data can be found at this zenodo link, and code at this github link (simply copy and paste code above to replace the pipeline chunks and you may have to install dgl-ke).

!sudo pip3 install torch !sudo pip3 install dgl==0.4.3 !sudo pip3 install dglke

Thanks for your help!

classicsong commented 3 years ago

Can you try adding --rel_part

DJRavinszkha commented 3 years ago

After running it for approximately an hour (with log_interval set to 100) This is the output I have thus far.

Screenshot 2021-06-10 at 11 04 15 AM

And the following values indicate GPU temp, power, utilisation etc

Screenshot 2021-06-10 at 11 00 14 AM

Do these seem to make sense to you?

classicsong commented 3 years ago

How large is your graph?

DJRavinszkha commented 3 years ago

~10million triples, ~73000 Entities, 42 relations

classicsong commented 3 years ago

Can you try run it with single GPU first? BTW, for rotatE, you need to add -de

For multigpu cmd, you can refer to https://github.com/awslabs/dgl-ke/blob/master/examples/wikikg2/multi_gpu.sh.

DJRavinszkha commented 3 years ago

You are such a life saver!!! I didn't realise i was missing the -de.

It works!!!!!!

Screenshot 2021-06-10 at 12 14 51 PM

I will keep this open for some time, in case I run into any other issues, but for now I think the (first) issue (so far) has been solved.

Thank you so much for your quick and concise help!

DJRavinszkha commented 3 years ago

Hello Again,

I have been attempting to run the dglke_eval and dglke_predict functions on my pretrained model, however am encountering the following errors:

dglke_eval:

Screenshot 2021-06-14 at 10 46 23 PM

dglke_predict:

Screenshot 2021-06-14 at 10 46 35 PM

As you can see I have pytorch version 1.6 installed, as I could see this was the best version for dglke.

In addition to this, I am using the entities.tsv and relations.tsv initialised by dglke_train and my own list of heads, relations and tails. After running dglke_train successfully, the following output was saved:

Screenshot 2021-06-14 at 10 46 17 PM

and I expect the result of dglke_eval to be the same to this value.

Thanks for your help in advance!