SINr implementation to compute node embeddings from communities

nicolasdugue commented 8 months ago

As fans of the karateclub package, we propose to add the implementation of SINr, a graph embedding approach based on community detection to the package. SINr was first published in : Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021. pp.325-337, ⟨10.1007/978-3-030-74251-5_26⟩. ⟨hal-03197434⟩

It allows to produce interpretable node representations and can also compute word embeddings on textual graph data : Béranger, A., Dugué, N., Guillot, S., & Prouteau, T. (2023, November). Filtering communities in word co-occurrence networks to foster the emergence of meaning. In International Conference on Complex Networks and Their Applications (pp. 377-388).

A complete SINr implementation can be found here : https://github.com/SINr-Embeddings/sinr But the code we provide in this pull request is enough to train efficiently node embeddings using networkx and scipy. We would really appreciate our method to be included in the karateclub package that we use a lot !

LucaCappelletti94 commented 8 months ago

Hi @nicolasdugue! Thank you for your pull request.

First of all, if I am not mistaken, erase_base_features is not defined and is never used across the model, what is up with that? Same thing for workers.

Since this implementation as I understand it is primarily meant for didactical purposes, being it in NetworkX and all, could you kindly extend the comments and explain more extensively things such as:

What are the parameters for, especially erase_base_features and gamma and what do they do in your model? Are there optimal values? Is there any correlation between gamma and the number of nodes or the density of the graph (especially for gamma)?
How does the algorithm scale? What are the memory requirements for a graph with V nodes and E edges? Please provide both the computational and space complexities for the asymptotic worst case.

Another significant bit is testing. Please extend the test suite and test your implementation. Things such as the erase_base_features error would not have survived testing.

nicolasdugue commented 8 months ago

Hi @LucaCappelletti94, Thanks for your comment, I am glad you answered so fast !

I am sorry for the workers and erase_base_features mistakes !

Regarding your remarks/questions :

I tried to add some insights about the gamma parameter. Is it enough ?
I extended the test suite and SINr class now survives testing :)
Complexity of the algorithm is quasi-linear. It runs actually quite fast, and can scale easily to huge graphs. The only requirement is to be able to store the adjacency matrix and community membership matrix in ram.

I was unclear sorry, it is not included in networkx. But the implementation I provided here is based on networkx louvain community detection. In our package, we mostly rely on networkit.

Do you need some more information/tests ?

codecov-commenter commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.44%. Comparing base (5e1fe9c) to head (e6064c1).

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #156 +/- ## ========================================== + Coverage 97.41% 97.44% +0.03% ========================================== Files 63 64 +1 Lines 2712 2744 +32 ========================================== + Hits 2642 2674 +32 Misses 70 70 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

LucaCappelletti94 commented 8 months ago

Everything seems okay, I see there are a few typos in the comments. If you have the time to fix them I would appreciate it, otherwise I will do it when I get the time.

nicolasdugue commented 8 months ago

Hi @LucaCappelletti94, I tried to fix the typos in the comments, but my english is not great. I did my best but it may not be enough, I am sorry :( Thank you for your help, I am glad that SINr will be included in the package.

nicolasdugue commented 8 months ago

Hi, I asked someone to proofread me. Comments should now be okay :)

LucaCappelletti94 commented 8 months ago

Thanks, I appreciate it. Later today (or at the latest tomorrow) I will go through the whole thing and I look forward to merge the whole thing.

benedekrozemberczki / karateclub

SINr implementation to compute node embeddings from communities #156

Codecov Report