MKLab-ITI / pygrank

Recommendation algorithms for large graphs
Apache License 2.0
29 stars 4 forks source link

Performant use of sparse matrices? #16

Closed deklanw closed 1 year ago

deklanw commented 1 year ago

I'm trying to use pygrank with larger graphs: 100k-1m nodes, hundreds of millions of edges. My graphs are in sparse matrix format. So far I've just converted to networkx and used those:

g = nx.from_scipy_sparse_array(A, create_using=nx.DiGraph)

signal_dict = {i: 1.0 for i in seeds}

signal = pg.to_signal(g, signal_dict)

# normalize signal
signal.np /= signal.np.sum()

result = algorithm(signal).np

Is there a more performant option available?

maniospas commented 1 year ago

This is indeed very inefficient. There is no direct solution currently in the code base, because the library can get confused without knowing which backend has produced the sparse matrix (e.g. if it's a tensor). We may add functionality to address this in the future (this issue will be left open as a reminder until further decision), but for the time being you can use either of these workarounds:

Solution 1 You can make a wrapper class around the adjacency matrix that implements the most crucial operations of the pygrank.fastgraph interface. This does not support some of the more exotic components but will work fine in most cases. You can do it like this:

import pygrank as pg

class AdjacencyWrapper(pg.Graph):  # it's necessary to inherit pg.Graph to trick the library's internal type checking
    def __init__(self, adj):
        self.adj = adj
        self.num_nodes = adj.shape[0]

    def is_directed(self):
        return True  # return False if you know that adj is symmetric - this changes the default type of normalization applied on the graph

    def __iter__(self):
        return range(self.num_nodes).__iter__()

    def __len__(self):
        return self.num_nodes

    def to_scipy_sparse_array(self):
        return self.adj

g = AdjacencyWrapper(A)
...

Note that, if you have normalized A beforehand, you need to also pass the argument normalization="none" to your algorithm's graph filter constructor. I tested this on the development code, which is a little ahead of the latest released version, and it runs correctly, but let me know if you encounter some issue.

Solution 2: If your code allows it, I strongly recommend loading data into a graph with something like this (use networkx to load a weighted graph):

G = pg.Graph(directed=False)
groups = {}
with open(filename, 'r', encoding='utf-8') as file:
    for line in file:
        if len(line) != 0 and line[0] != '#':  # ignore comment lines or zero length lines
            splt = line[:-1].split()
            if len(splt) > 1:
                G.add_edge(splt[0], splt[1])  # just add the edge

and using the library to produce a normalized sparce adjacency matrix from the graph per

A = pg.preprocessor()(G)

This will be a scipy sparse array if you are on the numpy backend (the default backend) and you can use it in the rest of your code. Note that its values will be normalized though. The difference compared to using a random adjacency matrix is that it has been produced with the library's preprocessor, which adds some hidden fields to the array object that let it be used like a graph. E.g. you can then call pg.to_signal(A, signal_dict).

P.S. Input signal normalization is not needed - the library does this internally for computational stability anyway. If you want outcomes to sum to one add a pg.Normalize("sum") postprocessor to your algorithm (e.g. algorithm = algorithm >> pg.Normalize("sum)) which is the only way to guarantee this property for any algorithm. Btw you can also write signal = signal/pg.sum(signal) just fine.

deklanw commented 1 year ago

Thank you! This is helpful.

I think iterating over hundreds of millions of edges would be too slow for solution 2. For solution 1, which exotic operations would be broken?

maniospas commented 1 year ago

I'm mentioning Solution 2, because with pg.Graph it's considerably faster compared to parsing into a sparse matrix and then calling nx.from_scipy_sparse_array, as it seems like you are doing. But I agree that it should be slower than the direct conversion of the first solution. Maybe this needs an experiment to tell for sure.

The first solution will throw an exception if something is missing, so it's not like things will be working incorrectly without notification. I think it only fails to work with Modularity (this needs a ton of helper methods to compute), LinkAssessment (this needs to be able to retrieve lists of node neighbors), the "top" mode of SeedOversampling (not the default one), and the remove_intra_edges helper method (this is useful when conducting link prediction experimetns but is never called internally).

deklanw commented 1 year ago

Ah, that makes sense. In my experiments I already have saved sparse matrices. But if I do build them from scratch I will keep that in mind. Thanks.

Thankfully I'm not using any of those functions currently, but it's good that I'll get a thrown exception. Thanks

maniospas commented 1 year ago

The above-described scipy matrix wrapper class has been integrated version 0.2.10. Instructions added to the FAQ for big data analysis - this will be expanded in the future.