frankmcsherry / pagerank

Implementation of PageRank in timely dataflow
MIT License
72 stars 16 forks source link

Figures on very big graphs ? #8

Open antoine-de opened 6 years ago

antoine-de commented 6 years ago

Hi,

this project seems very interesting! I was wondering whether you had tested in on quite big graph like the host level common crawl with tens (or hundreds) of billion edges ? Do you have some figures on which cluster would be required to run it and a rough idea of the time that would be needed to compute the pagerank on it ?

I'm sorry I haven't yet delved enough into timely but would harmonic centrality be as easy to implement as the pagerank ?

frankmcsherry commented 6 years ago

I think this code has only been used on up to 4 billion edges (with the UK-2007) graph, but nothing prevents a larger graph. The main requirement is available memory or swap for the partitioned edge file, even though it is mostly traversed linearly.

We have processed the 2014 CC graph (64 billion edges) on timely with a different analysis (subgraph finding) over at: https://github.com/frankmcsherry/dataflow-join

I'm afraid I only know the definition of harmonic centrality, and not efficient algorithms for its computation. Generally, things like betweenness centrality do not have linear-time algorithms, which means bad news on massive graphs. Approximations to betweenness centrality, like only counting subsets of distances, can execute efficiently but you need to determine to what extent the approximation works for you.

frankmcsherry commented 6 years ago

And, if it helps, you can read more about performance on 1.6 billion edge and 3.7 billion edge graphs at:

https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-08.md https://github.com/frankmcsherry/blog/blob/master/posts/2015-07-31.md

The short answer is, twenty iterations on the 1.6B edge graph took about 20 seconds, using a cluster of 16 machines. There are scaling numbers there, and the times should probably just scale up linearly with the number of edges (perhaps getting worse with the number of nodes, as the random access working set increases).

antoine-de commented 6 years ago

great I'll try to run it on a big graph, thanks!

I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes.

frankmcsherry commented 6 years ago

Sounds good; the 128B common crawl graph has identifiers that still fit in a u32, if that helps, but if you have your own larger graph you may need to modify.

On Mar 2, 2018, at 14:53, Antoine D notifications@github.com wrote:

great I'll try to run it on a big graph, thanks!

I noticed that the nodes are in u32, I'll need to change them to u64 to handle 4B+ nodes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

antoine-de commented 6 years ago

yes I use a CC dataset with 5B nodes :confused: