dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.43k stars 3k forks source link

Caching node features in GPU #3461

Open molamooo opened 2 years ago

molamooo commented 2 years ago

🚀 Feature

Hi, this is researcher from SJTU. We hope to contribute DGL with the ability to cache node features in GPU when it's impossible to hold all features.

Motivation

When using GPU for sampling-based mini-batch training, it's likely that a GPU can not hold all node features. In this case, DGL choose to store all features at CPU. Our recent work finds that, part of node features cached in GPU and cache policy properly chosen, the cost of copying feature from GPU to CPU can be significantly reduced, e.g. caching 20% features leads to 80% cache hit rate.

Alternatives

The simplest method is to cache features with smallest node id. An efficient yet simple method is to cache features of nodes with highest degree, as recent work PaGraph proposed. We step further by running an offline epoch of sample and then caching the mostly visited nodes.

The degree-based method is easier to contribute, since it only requires knowledge of the graph. Though the sample-based policy achieves greatest performance, it's not easy to design an elegant interface, since it requires knowledge of sample algorithm.

Pitch

We'd like to know how the community think about the necessity of cache? I've lookup to DGL's PR history and find previous efforts to extend DGL with cache but failed to achieve agreement on the design of semantics #316 . We still hope to take over the job and first contribute the ability of degree-based cache, as it requires less design work but boosts performance significantly. As for the sample-based method, we'd like to borrow the help of community to identify the proper way to design new modules and port our prototype to DGL, if the proposal can be accepted.

jermainewang commented 2 years ago

Hi @Jeffery-Song , thanks for the proposal. We will discuss it internally first and back to you asap. cc @VoVAllen

molamooo commented 2 years ago

Looking forward to your reply.

yaox12 commented 2 years ago

Recently, a distributed GNN framework built upon PyG called Quiver achieves an impressive speedup in feature coalescing leveraging the hierarchical caching strategy. So I believe this proposal is very helpful, especially in large-scale graphs, which is a common and urgent scenario in the industry.

Details of the caching strategy in Quive can be found here.

jermainewang commented 2 years ago

Hi @Jeffery-Song , it's been a long wait. We have just published an RFC (#3600) that shall be quite relevant to what you've proposed here. The idea of the RFC is to define the storage interface for graph structure and features that DGL will comply with. You may want to have a look at the feature storage interface where the caching mechanism can be built under the hood. By doing so, we hope to make the caching mechanism compatible with all kinds of sampling algorithms (not only neighbor sampling). Feel free to leave your feedback of the RFC.

Here are some of my further questions regarding to the caching policy.

molamooo commented 2 years ago

Glad to see the proposal is making progress! I must say that my experience on GNN is limited, so feel free to challenge these insights.

jermainewang commented 2 years ago

Thanks for the points. They are really helpful for me to understand the scope of the proposal. See if my summary is accurate.

cc @davidmin7 @nv-dlasalle @BarclayII

molamooo commented 2 years ago

👏🏻Neat summary! I assume these should cover most use cases.

nv-dlasalle commented 2 years ago

I think this is a good fit for the customized FeatureStorage in #3600.

davidmin7 commented 2 years ago

I've read both the new FeatureStroage proposal and the caching idea and they look very exciting. If DGL can systematically support the new data loading pipeline combined with UVA and caching, that could surely benefit majority of the framework users.

However, I just want to add some clarification here. It is difficult to use the UVA with page migration (not the zero-copy access method) in the multiprocessing environment because the memory allocated with cudaMallocManaged() cannot be shared with different processes. That's why I proposed the combinations of the shared memory + cudaHostRegister() in my previous works. The memory pinned with cudaHostRegister() can be only used as a zero-copy access memory, and not for the page migration.

jermainewang commented 2 years ago

@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?

molamooo commented 2 years ago

@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?

Sure! Where do me discuss? WeChat, online-meeting or email?

jermainewang commented 2 years ago

Had an offline discussion with @Jeffery-Song. The conclusion is that Jeffery will try develop a prototype of CachedFeatureStorage which is a wrapper class over any other FeatureStorage object and provides caching mechanism internally. Est. time for the prototype is by the end of the month.

molamooo commented 2 years ago

I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.

jermainewang commented 2 years ago

I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.

  • Not sure where to put the file. I only found a draft pr related to the new sampling pipeline. Maybe I just continue development and move it to the correct location once everything else is stable?

I will suggest so too.

  • Currently I'll make async_fetch simply calls synchronous fetching. Same reason as the first point.

You could ignore async_fetch at the moment as we find it may need more clarification from actual use cases.

  • Current design is only targeting a static, degree-based cache. Previously we've discussed about support caching hot features according to footprint of previous fetching in the future. Is it better to support both method in this same storage class, or separately? Or is it up to me to give an initial design?

I'm good with either as long as you could demonstrate the effectiveness of your cache strategy on some important benchmarks.

  • The decision of cache ratio is not trivial, since this can be affected by the model, source feature dimension, and various gpu devices of users. The easiest way is to set cache space at a fixed rate(e.g. 50%) of device memory, then calculates back the cache ratio according to feature's dimension. Another way is to start at a low cache rate, and provide an API to enlarge the cache rate after several epochs. Any advices?

How about letting users set an (initial) cache rate as well as a strategy argument? The options are "static" or whatever strategies you feel would be most effective in the first version.