Caching node features in GPU

molamooo commented 2 years ago

🚀 Feature

Hi, this is researcher from SJTU. We hope to contribute DGL with the ability to cache node features in GPU when it's impossible to hold all features.

Motivation

When using GPU for sampling-based mini-batch training, it's likely that a GPU can not hold all node features. In this case, DGL choose to store all features at CPU. Our recent work finds that, part of node features cached in GPU and cache policy properly chosen, the cost of copying feature from GPU to CPU can be significantly reduced, e.g. caching 20% features leads to 80% cache hit rate.

Alternatives

The simplest method is to cache features with smallest node id. An efficient yet simple method is to cache features of nodes with highest degree, as recent work PaGraph proposed. We step further by running an offline epoch of sample and then caching the mostly visited nodes.

The degree-based method is easier to contribute, since it only requires knowledge of the graph. Though the sample-based policy achieves greatest performance, it's not easy to design an elegant interface, since it requires knowledge of sample algorithm.

Pitch

We'd like to know how the community think about the necessity of cache? I've lookup to DGL's PR history and find previous efforts to extend DGL with cache but failed to achieve agreement on the design of semantics #316 . We still hope to take over the job and first contribute the ability of degree-based cache, as it requires less design work but boosts performance significantly. As for the sample-based method, we'd like to borrow the help of community to identify the proper way to design new modules and port our prototype to DGL, if the proposal can be accepted.

jermainewang commented 2 years ago

Hi @Jeffery-Song , thanks for the proposal. We will discuss it internally first and back to you asap. cc @VoVAllen

molamooo commented 2 years ago

Looking forward to your reply.

yaox12 commented 2 years ago

Recently, a distributed GNN framework built upon PyG called Quiver achieves an impressive speedup in feature coalescing leveraging the hierarchical caching strategy. So I believe this proposal is very helpful, especially in large-scale graphs, which is a common and urgent scenario in the industry.

Details of the caching strategy in Quive can be found here.

jermainewang commented 2 years ago

Hi @Jeffery-Song , it's been a long wait. We have just published an RFC (#3600) that shall be quite relevant to what you've proposed here. The idea of the RFC is to define the storage interface for graph structure and features that DGL will comply with. You may want to have a look at the feature storage interface where the caching mechanism can be built under the hood. By doing so, we hope to make the caching mechanism compatible with all kinds of sampling algorithms (not only neighbor sampling). Feel free to leave your feedback of the RFC.

Here are some of my further questions regarding to the caching policy.

What type of hardware hierarchy are you targeting? Is it CPU->GPU or Disk->CPU or multiple hierarchies?
As some of the latest papers showing the effectiveness of using CUDA UVA, do you think caching is still necessary if all the feature data can be stored in UVA?
Is there an eviction policy? Currently I only see proposals about what feature data to be included.
How to handle read-write data (e.g., when the model contains node embedding that needs to be updated)?

molamooo commented 2 years ago

Glad to see the proposal is making progress! I must say that my experience on GNN is limited, so feel free to challenge these insights.

I’ve only studied the scenarios of caching host features in gpu. But this should be up to DGL’s own pace, if DGL is supporting feature size exceeding host memory.
About UVA. In my understanding, UVA should be similar to paging in OS. Application with spatial and temporal locality. However, in GNN applications, access to node feature should be quite random. It is likely that only minority of features in same page is accessed, and frequent page fault can easily degrade performance. It requires non-trivial efforts to efficiently leverage UVA, e.g. grouping hot nodes to same page, and schedule accesses to UVA-managed memory carefully. On the other hand, a self-managed cache should be more effective without introducing much complexity.

Also, there seems do not exist a generally accepted way to leverage UVA effectively, while I’ve seen degree-based static cache adopted many times(even as a baseline).
Personally I expect that fine-grained eviction should be unnecessary, as accessed feature are not likely to be re-accessed in a short term, even for hot nodes(a micro-benchmark may be required to prove this?). It is likely that an eviction-enabled implementation leads to similar performance as UVA.

However, eviction at coarse-grained should be effective, e.g. rebuild cache based on previous epoch’s access frequency. Our prior experiment already show that caching hottest nodes of the first epoch can significantly improve cache hit rate in later epochs.
I’m not sure if I’m understanding rw data correctly. My original idea is to cache features associated with graph dataset, which is commonly static(I haven’t encountered use cases where feature of input graph dataset is updated). If such case does exist, a naive method is to update caches as-long-as the original feature is updated. Since the FeatureStorage in the RFC should be providing a copy of extracted feature, I think a dedicated api should be added to allow modification to the original feature? Meanwhile, caches should be maintained inside this api.

jermainewang commented 2 years ago

Thanks for the points. They are really helpful for me to understand the scope of the proposal. See if my summary is accurate.

We assume users are able to load the entire features into host memory and the caching mechanism will decide what to be cached on GPU for training.
We will first focus on static caching -- selectively caching hot features by analyzing the graph structure before training. The cost of analysis could be amortized if the training takes a long time.
We may support a limited form of refreshing cached contents. That can be either triggered explicitly (by calling some APIs after each epoch) or implicitly (by detecting the access distribution drift in a certain period of time). In either way, we expect the frequency of cache refreshing to be low.
The caching mechanism will not be affected if the original features are stored in UVA. The only(?) difference is the copy operation used when there is a cache miss. If the original features are stored in regular CPU RAM, we will first slice out the required data on CPU and issue a CUDAMemCopyH2D to fetch it to GPU. If the features are stored in UVA, we will directly issue a CUDA kernel to fetch the data (similar to what @davidmin7 ) has done for his UVA feature tensor.
Asynchronous I/O will be handled by DGL dataloader logic so the caching mechanism is unaware of that.
We will first focus on read-only data. Support of rw data needs more clarification on the possible access pattern and may belong to a new RFC.

cc @davidmin7 @nv-dlasalle @BarclayII

molamooo commented 2 years ago

👏🏻Neat summary! I assume these should cover most use cases.

nv-dlasalle commented 2 years ago

I think this is a good fit for the customized FeatureStorage in #3600.

davidmin7 commented 2 years ago

I've read both the new FeatureStroage proposal and the caching idea and they look very exciting. If DGL can systematically support the new data loading pipeline combined with UVA and caching, that could surely benefit majority of the framework users.

However, I just want to add some clarification here. It is difficult to use the UVA with page migration (not the zero-copy access method) in the multiprocessing environment because the memory allocated with cudaMallocManaged() cannot be shared with different processes. That's why I proposed the combinations of the shared memory + cudaHostRegister() in my previous works. The memory pinned with cudaHostRegister() can be only used as a zero-copy access memory, and not for the page migration.

jermainewang commented 2 years ago

@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?

molamooo commented 2 years ago

@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?

Sure! Where do me discuss? WeChat, online-meeting or email?

jermainewang commented 2 years ago

Had an offline discussion with @Jeffery-Song. The conclusion is that Jeffery will try develop a prototype of CachedFeatureStorage which is a wrapper class over any other FeatureStorage object and provides caching mechanism internally. Est. time for the prototype is by the end of the month.

molamooo commented 2 years ago

I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.

Not sure where to put the file. I only found a draft pr related to the new sampling pipeline. Maybe I just continue development and move it to the correct location once everything else is stable?
Currently I'll make async_fetch simply calls synchronous fetching. Same reason as the first point.
Current design is only targeting a static, degree-based cache. Previously we've discussed about support caching hot features according to footprint of previous fetching in the future. Is it better to support both method in this same storage class, or separately? Or is it up to me to give an initial design?
The decision of cache ratio is not trivial, since this can be affected by the model, source feature dimension, and various gpu devices of users. The easiest way is to set cache space at a fixed rate(e.g. 50%) of device memory, then calculates back the cache ratio according to feature's dimension. Another way is to start at a low cache rate, and provide an API to enlarge the cache rate after several epochs. Any advices?

jermainewang commented 2 years ago

I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.

Not sure where to put the file. I only found a draft pr related to the new sampling pipeline. Maybe I just continue development and move it to the correct location once everything else is stable?

I will suggest so too.

Currently I'll make async_fetch simply calls synchronous fetching. Same reason as the first point.

You could ignore async_fetch at the moment as we find it may need more clarification from actual use cases.

Current design is only targeting a static, degree-based cache. Previously we've discussed about support caching hot features according to footprint of previous fetching in the future. Is it better to support both method in this same storage class, or separately? Or is it up to me to give an initial design?

I'm good with either as long as you could demonstrate the effectiveness of your cache strategy on some important benchmarks.

The decision of cache ratio is not trivial, since this can be affected by the model, source feature dimension, and various gpu devices of users. The easiest way is to set cache space at a fixed rate(e.g. 50%) of device memory, then calculates back the cache ratio according to feature's dimension. Another way is to start at a low cache rate, and provide an API to enlarge the cache rate after several epochs. Any advices?

How about letting users set an (initial) cache rate as well as a strategy argument? The options are "static" or whatever strategies you feel would be most effective in the first version.

dmlc / dgl