emartinezs44 / SparkGCN

GCN implementation on top of Apache Spark
14 stars 4 forks source link

[Question] What is the max size of the graph supported by this implementation? #5

Closed Ishitori closed 1 year ago

Ishitori commented 1 year ago

Hi Emiliano,

Thanks for the great project! I am curious to try it out, but before I start experimenting with it, I am curious to learn how big the graph can be? Can it be a few million nodes? What about few hundred million?

Thanks.

emartinezs44 commented 1 year ago

This implementation was created having in mind processing big graphs using Spark capabilities, without using other additional hardware taking to the limit Intel CPUs capabilities(mkl, or even mkldnn).

Some things to take into account:

  1. The adjacency matrix is created in the driver using Breeze. This is because Breeze allows to perform the operations that are necessary to normalize the matrix easily. Depending of how dense is this matrix and the memory available in the driver you could create very big matrices. Note that the matrix is sparse and all operations are performed using this kind of matrix. Even you could use Spark distributed matrices to do this but this is more complicated(not impossible). So depending of how many connections the nodes have you could handle sparse matrices with millions of nodes with the CSCMatrix.
  2. The current implementation uses only one partition. If you see the model, the GraphConvLayer does the convolution to the whole graph. The result of this operation is a dense tensor. This could be a problem only with one partition depending of the number of nodes and the memory of the executor:

The dimensions of the tensors that are involved in the convolution are: Normalized_Adjacency(nodes_number, nodes_number) x Input_Graph(batch=nodes_number, features_number) = Input_graph_with_convolution(nodes_number, features_number)

This last matrix is dense, so depending of the number of nodes and the features you could handle more or less millions of nodes. Note that the time using a partition increases as long as your graph increases too.

  1. There is one approach that I have tested for bigger graphs and I haven't included in the repo yet:

I split the graph in different components and put a set of these components is different partitions using a custom RDD partitioner.That allows to perform the convolution and the forward process using n executors, thereby spliting the batch across different executors(dozens of them) to do the convolution process for much bigger graphs. Note that the weights of the neural network are synchronized every iteration.

I had in mind to publish this approach, if you have some specific case we can try. Although I would try to check this implementation with only one partition and if you have problems, we can work in it.

Ishitori commented 1 year ago
  1. Got it.
  2. Hm, this sound like a scaling problem, since it depends on both number of features and number of nodes...
  3. Are you familiar with https://github.com/facebookresearch/PyTorch-BigGraph ? They do partitioning of a big graph into separate partitions and process them. Maybe you could borrow the idea from there?
emartinezs44 commented 1 year ago

I will include the distribute implementation. The idea can be similar to BigGraph. The point here is to use only spark primitives ad spark graph based frameworks.

Ishitori commented 1 year ago

Sounds good!