Open molamooo opened 2 years ago
Hi @Jeffery-Song , thanks for the proposal. We will discuss it internally first and back to you asap. cc @VoVAllen
Looking forward to your reply.
Recently, a distributed GNN framework built upon PyG called Quiver achieves an impressive speedup in feature coalescing leveraging the hierarchical caching strategy. So I believe this proposal is very helpful, especially in large-scale graphs, which is a common and urgent scenario in the industry.
Details of the caching strategy in Quive can be found here.
Hi @Jeffery-Song , it's been a long wait. We have just published an RFC (#3600) that shall be quite relevant to what you've proposed here. The idea of the RFC is to define the storage interface for graph structure and features that DGL will comply with. You may want to have a look at the feature storage interface where the caching mechanism can be built under the hood. By doing so, we hope to make the caching mechanism compatible with all kinds of sampling algorithms (not only neighbor sampling). Feel free to leave your feedback of the RFC.
Here are some of my further questions regarding to the caching policy.
Glad to see the proposal is making progress! I must say that my experience on GNN is limited, so feel free to challenge these insights.
I’ve only studied the scenarios of caching host features in gpu. But this should be up to DGL’s own pace, if DGL is supporting feature size exceeding host memory.
About UVA. In my understanding, UVA should be similar to paging in OS. Application with spatial and temporal locality. However, in GNN applications, access to node feature should be quite random. It is likely that only minority of features in same page is accessed, and frequent page fault can easily degrade performance. It requires non-trivial efforts to efficiently leverage UVA, e.g. grouping hot nodes to same page, and schedule accesses to UVA-managed memory carefully. On the other hand, a self-managed cache should be more effective without introducing much complexity.
Also, there seems do not exist a generally accepted way to leverage UVA effectively, while I’ve seen degree-based static cache adopted many times(even as a baseline).
Personally I expect that fine-grained eviction should be unnecessary, as accessed feature are not likely to be re-accessed in a short term, even for hot nodes(a micro-benchmark may be required to prove this?). It is likely that an eviction-enabled implementation leads to similar performance as UVA.
However, eviction at coarse-grained should be effective, e.g. rebuild cache based on previous epoch’s access frequency. Our prior experiment already show that caching hottest nodes of the first epoch can significantly improve cache hit rate in later epochs.
I’m not sure if I’m understanding rw data correctly. My original idea is to cache features associated with graph dataset, which is commonly static(I haven’t encountered use cases where feature of input graph dataset is updated). If such case does exist, a naive method is to update caches as-long-as the original feature is updated. Since the FeatureStorage
in the RFC should be providing a copy of extracted feature, I think a dedicated api should be added to allow modification to the original feature? Meanwhile, caches should be maintained inside this api.
Thanks for the points. They are really helpful for me to understand the scope of the proposal. See if my summary is accurate.
cc @davidmin7 @nv-dlasalle @BarclayII
👏🏻Neat summary! I assume these should cover most use cases.
I think this is a good fit for the customized FeatureStorage in #3600.
I've read both the new FeatureStroage proposal and the caching idea and they look very exciting. If DGL can systematically support the new data loading pipeline combined with UVA and caching, that could surely benefit majority of the framework users.
However, I just want to add some clarification here. It is difficult to use the UVA with page migration (not the zero-copy access method) in the multiprocessing environment because the memory allocated with cudaMallocManaged() cannot be shared with different processes. That's why I proposed the combinations of the shared memory + cudaHostRegister() in my previous works. The memory pinned with cudaHostRegister() can be only used as a zero-copy access memory, and not for the page migration.
@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?
@Jeffery-Song I think for the next step, we should have an agreement on the actual implementation plan. Since we are in the same timezone, how about having a short discussion between two of us?
Sure! Where do me discuss? WeChat, online-meeting or email?
Had an offline discussion with @Jeffery-Song. The conclusion is that Jeffery will try develop a prototype of CachedFeatureStorage
which is a wrapper class over any other FeatureStorage object and provides caching mechanism internally. Est. time for the prototype is by the end of the month.
I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.
async_fetch
simply calls synchronous fetching. Same reason as the first point.I've opened a PR #3634 which currently only contains doc-string and basic description of the procedure. Now I have several concerns.
- Not sure where to put the file. I only found a draft pr related to the new sampling pipeline. Maybe I just continue development and move it to the correct location once everything else is stable?
I will suggest so too.
- Currently I'll make
async_fetch
simply calls synchronous fetching. Same reason as the first point.
You could ignore async_fetch
at the moment as we find it may need more clarification from actual use cases.
- Current design is only targeting a static, degree-based cache. Previously we've discussed about support caching hot features according to footprint of previous fetching in the future. Is it better to support both method in this same storage class, or separately? Or is it up to me to give an initial design?
I'm good with either as long as you could demonstrate the effectiveness of your cache strategy on some important benchmarks.
- The decision of cache ratio is not trivial, since this can be affected by the model, source feature dimension, and various gpu devices of users. The easiest way is to set cache space at a fixed rate(e.g. 50%) of device memory, then calculates back the cache ratio according to feature's dimension. Another way is to start at a low cache rate, and provide an API to enlarge the cache rate after several epochs. Any advices?
How about letting users set an (initial) cache rate as well as a strategy argument? The options are "static" or whatever strategies you feel would be most effective in the first version.
🚀 Feature
Hi, this is researcher from SJTU. We hope to contribute DGL with the ability to cache node features in GPU when it's impossible to hold all features.
Motivation
When using GPU for sampling-based mini-batch training, it's likely that a GPU can not hold all node features. In this case, DGL choose to store all features at CPU. Our recent work finds that, part of node features cached in GPU and cache policy properly chosen, the cost of copying feature from GPU to CPU can be significantly reduced, e.g. caching 20% features leads to 80% cache hit rate.
Alternatives
The simplest method is to cache features with smallest node id. An efficient yet simple method is to cache features of nodes with highest degree, as recent work PaGraph proposed. We step further by running an offline epoch of sample and then caching the mostly visited nodes.
The degree-based method is easier to contribute, since it only requires knowledge of the graph. Though the sample-based policy achieves greatest performance, it's not easy to design an elegant interface, since it requires knowledge of sample algorithm.
Pitch
We'd like to know how the community think about the necessity of cache? I've lookup to DGL's PR history and find previous efforts to extend DGL with cache but failed to achieve agreement on the design of semantics #316 . We still hope to take over the job and first contribute the ability of degree-based cache, as it requires less design work but boosts performance significantly. As for the sample-based method, we'd like to borrow the help of community to identify the proper way to design new modules and port our prototype to DGL, if the proposal can be accepted.