VHRanger / CSRGraph

A tiny library for large graphs
MIT License
111 stars 17 forks source link

How to map random walk result back to node name? #6

Closed SnoringinSeattle closed 3 years ago

SnoringinSeattle commented 3 years ago

Hi, thanks a lot for creating this library, it's been very helpful to graph embedding in python! I have a quite large graph, with 200k nodes and 600m edges, thus the random walk sampling step of node2vec will take quite some time. Therefore I am using this package for the random walk piece and then save the random walk and use gensim for word2vec part directly. The random walk part result seems largely falls in range(0, number_of_nodes), with a few hundred exceptions. I noticed "3419130827" appeared a few times, since it is out of range, does that just mean the random walk is at a dead end? I am also curious how to map 0,1,2,3 ... (ie a random walk result [0,57,235 ...]) back to node names. Thanks!

SnoringinSeattle commented 3 years ago

I did a simple len = 2 walk with 25 nodes, and the result looks like this. Can't wrap my head around how to interpretate it:

array([[ 8, 3419130827], [ 2, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [ 18, 3419130827], [ 2, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 3419130827], [1530536548, 0], [1530536548, 0], [1530536548, 10], [1530536548, 0], [1530536548, 0], [1530536548, 3], [1530536548, 32722], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [ 21, 32719], [ 2, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [ 4, 32719], [ 2, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [ 23, 32719], [ 2, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [ 15, 32719], [ 2, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 1], [1530536548, 32719], [1530536548, 32720], [1530536548, 32719], [1530536548, 32720], [1530536548, 32719], [1530536548, 32720], [ 8, 32719], [ 2, 32720], [1530536548, 32719], [1530536548, 32721], [1530536548, 32719], [1530536548, 32720], [1530536548, 32719], [1530536548, 32720], [1530536548, 0], [1530536548, 0], [1530536548, 32719], [1530536548, 32719], [1530536548, 0], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 0], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 0], [1530536548, 0], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 0], [ 24, 32719], [ 2, 32719], [1530536548, 0], [1530536548, 0], [1530536548, 0], [1530536548, 0], [1530536548, 32719], [1530536548, 32719], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [ 15, 32719], [ 2, 0], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 32719], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [ 19, 4294967295], [ 2, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 32719], [1530536548, 4294967295], [1530536548, 32719], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 32719], [1530536548, 0], [1530536548, 4294967295], [1530536548, 0]], dtype=uint32)

VHRanger commented 3 years ago

Hey, so a couple of points:

  1. There's the keep_walks=True option you can set directly from the model = nodevectors.Node2Vec(keep_walks=True) class. You can then access it with model.walks.

  2. For huge graphs, I recommend trying GGVec and ProNE before Node2Vec.

  3. The walks you're seeing in the g.random_walks() method are the node IDs rather than the node names (they can be different even if the node names are numbers). The fastest way to map it back to node names is by making it a pd.DataFrame and using df.map(g.names). Something like walks = pd.DataFrame(walks).map(g.names) (what's done in the node2vec implementation.

That said I'm concerned about the huge numbers in your results, you didn't seem to imply you had that many nodes.

SnoringinSeattle commented 3 years ago

Thanks for the quick response! And thanks for creating this package!

Re: That said I'm concerned about the huge numbers in your results, you didn't seem to imply you had that many nodes. Yes this is quite confusing to me. While running random walk using the full set(~200k nodes), I got a few thousand nodes that's out of range. I later tried a smaller sample with only ~25 nodes, and still see the large numbers. I originally thought it's a place holder for early terminated walks but that seems not the case. Have you seen this happening before? And suggestion on how to debug here? A related observation is I found when I load the 25 node graph, the csr.mat.data is a list of 25 1s (ie[1,1,1,.....1]). Could it be that the graph was not loaded correctly?

VHRanger commented 3 years ago

Have you seen this happening before?

If it was such a bug, normally it would segfault, because those node IDs we see are used as pointers into the other CSR matrix arrays.

If we have huge node IDs lying around, you'd be pointing to random places in memory and segfaulting.

I got a few thousand nodes that's out of range.

That's possible if some nodes are only "receivers" (they still need to be indexed) or in other similar edge cases.

csr.mat.data is a list of 25 1s (ie[1,1,1,.....1]).

That's normal. If you don't have edge weights, the data array is the edge weights which are just 1 for every edge

VHRanger commented 3 years ago

The segfault bug has been fixed in latest commit.