dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.43k stars 3k forks source link

How does DGL save gpu memory? #1690

Closed maqy1995 closed 4 years ago

maqy1995 commented 4 years ago

❓ Questions and Help

   Hi, I am new to dgl and gnn.

  When I run the graphSAGE example on the Reddit dataset(my GPU is Tesla T4), I found that DGL can add all training sets for training or inferencing, while pyG will be OOM when the batch size reaches about 9000.   I found in the experiment that when the batch size is increased from 1024 to 8192, the memory usage(inference) of dgl is about 1G to 2G, and the memory usage of pyG is expanded from 2G to 12G. It seems that in dgl, the batch size increased by 8 times, and there was no corresponding increase in memory. What is the reason for this phenomenon? What makes DGL memory use less than pyG?

ps: Is this benefit brought by ‘kernel fusion’?   I find the blog:https://www.dgl.ai/blog/2019/05/04/kernel.html   But I did not understand it clearly.

yzh119 commented 4 years ago

Yep it's because of kernel fusion. You can simply understand it as we directly compute the result on destination nodes without copying node feature to edges (saving the gpu memory cost of #edges * D), which is required by most scatter-gather based frameworks such as pyg.

maqy1995 commented 4 years ago

Yep it's because of kernel fusion. You can simply understand it as we directly compute the result on destination nodes without copying node feature to edges (saving the gpu memory cost of #edges * D), which is required by most scatter-gather based frameworks such as pyg.

Thanks for your reply, is there any more detailed documents or papers on kernel fusion or scatter-gather?

yzh119 commented 4 years ago

We will post the updated version of DGL paper on ArXiv describing the new features and system design of DGL in these days, please stay tuned.

maqy1995 commented 4 years ago

Great! Thanks again for your reply, I will close this issue.