CGCL-codes / Frog

Frog is Asynchronous Graph Processing on GPU with Hybrid Coloring Model. The fundamental idea is based on Pareto principle (or 80-20 rule) about coloring algorithms as we observed through masses of real graph coloring cases.
http://cgcl.grid.hust.edu.cn/xhshi/projects/frog.html
GNU General Public License v2.0
36 stars 23 forks source link

Twitter dataset runs out of memory #6

Open sgpyc opened 7 years ago

sgpyc commented 7 years ago

Hi, the Frog technical report indicates that it can process the twitter graph (41.7M vertices and 1.47B edges) on K20m with 6GB of memory.

However, when I tested it on K40c with 12GB of memory, it runs out of memory. Looking into the bfs.cu line 180 to 181, the pair of CudaBufferFill essentially allocates 2 int (4 bytes each) for each edge in the graph, and move the source and destination vertex index from CPU to GPU.

Here is my question, 1.47 B edges * 2 vertices / edge * 4 byte / vertex = 11.76 x 10^9 Bytes, how can it be fit into K20m's 6GB memory without streaming? It is even larger than K40c's useable memory, which is 11.439 x 10^9 Bytes.

~/Projects/Frog/src/exp ./twitter_rv.net.bin
Reading File ... 46193.47 ms 
Begin Experiments on Graph (V=41652230 E=1468365181 File='./twitter_rv.net.bin')
-------------------------------------------------------------------
Partitioning ... 18949.75 ms ... Get partitions ... 12962.69 ms
    Time    Total   Tips
    36374.11    BFS on CPU  Step=14 Visited=35016137
GPU Memory Allocation Failed in File 'bfs.cu' at Line 180!
    INFO : out of memory
AndrewStallman commented 7 years ago

The method to process the graph fitting the GPU memory and the large-scale graph which can not fit the GPU memory is a little different. For a large-scale graph, the process described in our technical report is shown below :

For a large-scale graph, one of the partitions may contain too many vertices such that the GPU memory is still not large enough to hold them at a time. For the first four partitions, we can just divide them into subsets based on the memory size, because each of them shares the same color. For the last hybrid partition, we re-process the sub-graph by coloring these vertices again. We process the sub-graph of the hybrid partition as a new graph with our relaxed coloring algorithm to guarantee that each partition generated by our coloring algorithm must fit the GPU memory.

The current version of the four algorithms is used for processing the graph which fits in GPU memory. For the large-scale graph processing, please see at https://github.com/AndrewStallman/Frog/tree/current/src/big-data-process .

How to use the code to process twitter graph?

And we find the size affects the performance, and we are dealing with the problem and the code is unstable, so we didn't push this part (https://github.com/AndrewStallman/Frog/tree/current/src/big-data-process ) on GitHub.

sgpyc commented 7 years ago

Thanks for uploading the code for large graph and the procedure of running it. It compiles and runs on our K40.

I tried to give different sizes, and it seems larger size (maximum number of edges to be stored on GPU at a time) yields better performance, provided the program doesn't run out of memory. For the twitter_rv dataset on K40, the largest size I tried is 0.9 billion edges (using 10.8 GB GPU memory to store edge info for SSSP, 1 billion edges run out of memory). The results are as following:

Projects/Frog/src/exp twitter_rv.net.bin 900000000
Reading File ... 3893.83 ms 
Begin Experiments on Graph (V=41652230 E=1468365181 File='twitter_rv.net.bin')
Partitioning ... 22475.36 ms ... Get partitioins ... 8529.45 ms
    Time    Total   Tips
    1612.41 46284.23    part_edge_loop  step=14 
-------------------------------------------------------------------
Partitioning ... 25213.76 msGet partitions  ... 8536.37 ms
    Time    Total   Tips
    80141.21    35806.38    pagerank_edge_loop  step=10 

-------------------------------------------------------------------
Partitioning ... 24251.28 ms ... Get partitions ... 8859.18 ms
    Time    Total   Tips
    5571.02 28513.75    part_cc_edge_loop   step=7  
-------------------------------------------------------------------
Partitioning ... 22693.47 ms ... Get partitions ... 8615.90 ms
    Time    Total   Tips
    5472.23 39894.22    sssp_edge_part_loop step=6  
Done Experiments

Since the approach uses streaming, I would use the total time (compute + data movement) for comparison, instead of the compute time. By the way, the total and the compute time reported by PR are swapped.

Let me know if there is anything unexpected in the results.

AndrewStallman commented 7 years ago

Thanks for helping us find the mistake when we output the value of time in PR.cu. And we have revised it. Most graph processing systems compare the compute time with other systems in their established paper, so did in our paper. Actually, we did not use streaming in all kinds of data movement. The total time includes the compute time and data movement with streaming approach and data movement without streaming approach.