Details - Githubissues

arita37 commented 2 years ago

Hello,

Just one quick question on this, does the GI AE algo can scale to 10 mio of items, 100 mio of items ?

One common issue is to have things that scale, just outside of a local laptop compute.

GuillaumeSalhaGalvan commented 2 years ago

Dear @arita37,

Thank you for your message.

We discuss scalability concerns in Section 3.6 of our paper. In essence, the code provided in this repository suffers from a quadratic complexity w.r.t. the number of nodes n in the graph. At each training iteration, we need to reconstruct entire n x n adjacency matrices to evaluate reconstruction losses or ELBO objectives. As a consequence, on your own laptop, you will likely get errors when trying to reconstruct graphs with (roughly) n > 40000 or 50000 nodes. Note: this issue is not specific to gravity-inspired models, e.g., you would encounter similar errors when running standard GAE and VGAE models using T.Kipf's original code, as it also reconstructs entire adjacency matrices.

Do you need to train standard or gravity-inspired GAE/VGAE models on larger graphs?

Firstly, I would advice against adopting "naive" uniform mini-batch sampling strategies. By experience, this would often significantly deteriorate performances (intuitively, if you randomly reconstruct 1% of nodes/edges drawn at random, you do not necessarily get a good summary of the initial graph structure);
However, there exists more effective yet simple sampling strategies to scale GAE/VGAE to large graphs with millions of nodes and edges. You can take a look at our recent paper introducing the FastGAE method.

I am currently finishing my Ph.D. thesis. After my defense (~Feb. 2022), I want to incorporate FastGAE into this repository to improve scalability. Meanwhile, you can check the work of @fleverest who recently combined these two methods in one of his own repositories.

Best regards,

Guillaume

arita37 commented 2 years ago

Thanks for the details.

Usually, have 40 million items (nodes), scaling is 1st priority, otherwise cannot be tested meanfully (toy dataset does not mean much…)

Besides, do you think it can be integrated into existing distributed GNN framework ?

https://docs.dgl.ai/performance.html

Thanks

On Dec 18, 2021, at 23:57, G. Salha-Galvan @.***> wrote:

Dear @arita37,

Thank you for your message.

We discuss scalability concerns in Section 3.6 of our paper. In essence, the code provided in this repository suffers from a quadratic complexity w.r.t. the number of nodes n in the graph. At each training iteration, we need to reconstruct entire n x n adjacency matrices to evaluate reconstruction losses or ELBO objectives. As a consequence, on your own laptop, you will likely get errors when trying to reconstruct graphs with (roughly) n > 40000 or 50000 nodes. Note: this issue is not specific to gravity-inspired models, e.g., you would encounter similar errors when running standard GAE and VGAE models using T.Kipf's original code, as it also reconstructs entire adjacency matrices.

Do you need to train standard or gravity-inspired GAE/VGAE models on larger graphs?

Firstly, I would advice against adopting "naive" uniform mini-batch sampling strategies. By experience, this would often significantly deteriorate performances (intuitively, if you randomly reconstruct 1% of the nodes/edges, drawn at random, you do not necessarily get a good summary of the initial graph structure); However, there exists more effective (yet simple) sampling strategies to scale GAE/VGAE to large graphs with millions of nodes and edges. You can take a look at our recent paper introducing the FastGAE method. I am currently finishing my Ph.D. thesis. After my defense (~Feb. 2022), I want to incorporate FastGAE into this repository to improve scalability. Meanwhile, you can check the work of @fleverest who recently combined these two methods in one of his own repositories.

Best regards,

Guillaume

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

GuillaumeSalhaGalvan commented 2 years ago

I totally understand! At Deezer, most of our graphs also have millions of nodes, e.g., music artists or users.

Honestly, I am not very familiar with DGL. I didn't see any complete implementation of GAE/VGAE (maybe I missed it). However, the library seems to already contain all building blocks (including GCN encoder, inner product or embedding distance computations and cross entropy losses) to integrate these models.

arita37 commented 2 years ago

Sure, thanks Yes, main part is usually on the scalability part, especially for graph structure…

Is the updated version quadratic in number of nodes ?

Here, one implementation of GAE in DGL: https://github.com/shionhonda/gae-dgl/blob/master/gae_dgl/gae.py

On Dec 19, 2021, at 20:11, G. Salha-Galvan @.***> wrote:

I totally understand! At Deezer, most of our graphs also have millions of nodes, e.g., music artists or users.

Honestly, I am not very familiar with DGL. I didn't see any complete implementation of GAE/VGAE (maybe I missed it). However, the library seems to already contain all building blocks (including GCN encoder, inner product or embedding distance computations and cross entropy losses) to integrate these models.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

GuillaumeSalhaGalvan commented 2 years ago

The complexity of FastGAE depends on the sizes of stochastic subgraphs drawn for decoding, which is a selectable parameter. For specific sizes, motivated in the paper, the complexity evolves linearly, while still reaching fairly good performances on link prediction and community detection tasks (see, e.g., Section 3.4 and experiments for more details).

GuillaumeSalhaGalvan commented 2 years ago

I'm closing this issue, but feel free to contact me if needed!

arita37 commented 2 years ago

Thanks for the various feedback.

For the info, Deepmind releases a paper about constrating loss without negative sampling, which scales very well.

On Feb 4, 2022, at 22:37, G. Salha-Galvan @.***> wrote:

Closed #9.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

deezer / gravity_graph_autoencoders

Details #9