FluxML / GeometricFlux.jl

Geometric Deep Learning for Flux
https://fluxml.ai/GeometricFlux.jl/stable/
MIT License
348 stars 30 forks source link

revisit FeaturedGraph implementation #215

Closed CarloLucibello closed 3 years ago

CarloLucibello commented 3 years ago

This is the part of #204 concerning only the FeaturedGraph implementation. With respect to #204 and similarly to the current design, the implementation in this PR allows different underlying graph representations. In particular we have:

I looked a bit into CuSparseMatrixCSC, which is what you get when you do

julia> using CUDA, SparseArrays

julia> x = sprand(3,3, 0.5) |> cu
3×3 CUDA.CUSPARSE.CuSparseMatrixCSC{Float32} with 4 stored entries:
  ⋅        0.206154  0.759705
 0.255161   ⋅         ⋅ 
  ⋅         ⋅        0.466779

Unfortunately its array interface is very limited, e.g. broadcasting is not supported, so I fear we won't be able to rely on it for a long time. On the other hand, cpu support is already very good with this PR, in the tests I get a single error when trying to compute the eigenvalues of the laplacian, probably I will be able to fix it in this PR.

Two important differences with the current FeaturedGraph

After this PR gets in, we should do a careful performance analysis (see #214) and define specialized methods selecting the most performant path:

(gnn::GNNLayer)(fg::FeaturedGraph{<:COO_T}, X) = ....

A byproduct of this PR is that it brings back gpu support for some graph/layers combinations (partial fix to #195).

This is the output of a performance script similar to #214 I run on this PR. I added an extra column time_fg timing the featured graph creation.

julia> df = run_benchmarks();

julia> filter(r -> r.layer == "GraphConv", df)
12×7 DataFrame
 Row │ n       layer      gtype    time_fg                    time_cpu                   time_gpu  gpu_to_cpu  
     │ Int64?  String?    Symbol?  Any                        Any                        Any       TrialRati…? 
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     10  GraphConv  coo      TrialEstimate(2.074 μs)    TrialEstimate(564.894 μs)  missing   missing     
   2 │     10  GraphConv  dense    TrialEstimate(2.519 μs)    TrialEstimate(461.274 μs)  missing   missing     
   3 │     10  GraphConv  sparse   TrialEstimate(3.228 μs)    TrialEstimate(502.207 μs)  missing   missing     
   4 │    100  GraphConv  coo      TrialEstimate(14.347 μs)   TrialEstimate(6.341 ms)    missing   missing     
   5 │    100  GraphConv  dense    TrialEstimate(27.635 μs)   TrialEstimate(6.191 ms)    missing   missing     
   6 │    100  GraphConv  sparse   TrialEstimate(17.743 μs)   TrialEstimate(6.608 ms)    missing   missing     
   7 │   1000  GraphConv  coo      TrialEstimate(132.999 μs)  TrialEstimate(271.801 ms)  missing   missing     
   8 │   1000  GraphConv  dense    TrialEstimate(2.102 ms)    TrialEstimate(248.011 ms)  missing   missing     
   9 │   1000  GraphConv  sparse   TrialEstimate(162.581 μs)  TrialEstimate(264.949 ms)  missing   missing     
  10 │  10000  GraphConv  coo      TrialEstimate(1.380 ms)    TrialEstimate(2.340 s)     missing   missing     
  11 │  10000  GraphConv  dense    TrialEstimate(365.196 ms)  TrialEstimate(2.657 s)     missing   missing     
  12 │  10000  GraphConv  sparse   TrialEstimate(1.504 ms)    TrialEstimate(3.161 s)     missing   missing     

julia> filter(r -> r.layer == "GCNConv", df)
12×7 DataFrame
 Row │ n       layer    gtype    time_fg                    time_cpu                   time_gpu                   gpu_to_cpu           
     │ Int64?  String?  Symbol?  Any                        Any                        Any                        TrialRati…?          
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     10  GCNConv  coo      TrialEstimate(1.966 μs)    TrialEstimate(23.860 μs)   missing                    missing              
   2 │     10  GCNConv  dense    TrialEstimate(2.474 μs)    TrialEstimate(22.632 μs)   TrialEstimate(262.075 μs)  TrialRatio(1157.98%)
   3 │     10  GCNConv  sparse   TrialEstimate(3.725 μs)    TrialEstimate(24.894 μs)   missing                    missing              
   4 │    100  GCNConv  coo      TrialEstimate(14.183 μs)   TrialEstimate(144.782 μs)  missing                    missing              
   5 │    100  GCNConv  dense    TrialEstimate(27.814 μs)   TrialEstimate(223.214 μs)  TrialEstimate(269.616 μs)  TrialRatio(120.79%)
   6 │    100  GCNConv  sparse   TrialEstimate(17.480 μs)   TrialEstimate(143.833 μs)  missing                    missing              
   7 │   1000  GCNConv  coo      TrialEstimate(133.200 μs)  TrialEstimate(1.255 ms)    missing                    missing              
   8 │   1000  GCNConv  dense    TrialEstimate(2.142 ms)    TrialEstimate(10.964 ms)   TrialEstimate(1.578 ms)    TrialRatio(14.39%)
   9 │   1000  GCNConv  sparse   TrialEstimate(157.977 μs)  TrialEstimate(1.167 ms)    missing                    missing              
  10 │  10000  GCNConv  coo      TrialEstimate(1.386 ms)    TrialEstimate(12.814 ms)   missing                    missing              
  11 │  10000  GCNConv  dense    TrialEstimate(389.277 ms)  TrialEstimate(1.415 s)     TrialEstimate(145.412 ms)  TrialRatio(10.28%)
  12 │  10000  GCNConv  sparse   TrialEstimate(1.527 ms)    TrialEstimate(13.274 ms)   missing                    missing          

I hope this can address some of the concerns expressed by @yuehhua in #204

Fix #200, Fix #217, Fix https://github.com/yuehhua/GraphSignals.jl/issues/50

CarloLucibello commented 3 years ago

Now adjacency_matrix(fg::FeaturedGraph(<:COO_T) returns a sparse a matrix and performance with GCNConv is on par with graph_type=:sparse (and up 100x faster than :dense)

@yuehhua this is ready for review

CarloLucibello commented 3 years ago

bump, if we get this in I can proceed with the other changes (a lot of bug fixes and huge perf improvements)

CarloLucibello commented 3 years ago

@yuehhua I think this is complete. CUDA sparse support is still not very good. Hopefully it can be cleaned up a bit in futrue PRs here and in CUDA.jl (e.g. https://github.com/JuliaGPU/CUDA.jl/pull/1093)

yuehhua commented 3 years ago

Could you please move the FeaturedGraph part to GraphSignals? Thanks

CarloLucibello commented 3 years ago

Changes to FeaturedGraph is basically the whole PR! This work has been quite time consuming already, I fear applying these and a few future changes working on 3 packages would be a nightmare. Can we drop temporarily GraphSignals' dependence until things settle down?

yuehhua commented 3 years ago

I benchmark in my side. I listed the benchmark scores for GCNConv and GraphConv as representative example for spectral-based graph convolution and spatial-based graph convolution, respectively. I also benchmark the new design SparseGraph graph structure in yuehhua/GraphSignals.jl#54. I failed to benchmark the GPU part.

julia> df[df.layer .== "GCNConv", :]
20×8 DataFrame
 Row │ N       c         layer    gtype        time_fg                    time_cpu                   time_gpu  gpu_to_cpu 
     │ Int64?  Float64?  String?  Symbol?      Any                        Any                        Any       Float64?   
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     10       6.0  GCNConv  adjlist      TrialEstimate(401.628 ns)  TrialEstimate(33.625 μs)   missing         Inf
   2 │     10       6.0  GCNConv  dense        TrialEstimate(1.475 μs)    TrialEstimate(34.917 μs)   missing         Inf
   3 │     10       6.0  GCNConv  lightgraph   TrialEstimate(371.961 ns)  TrialEstimate(32.292 μs)   missing         Inf
   4 │     10       6.0  GCNConv  sparse       TrialEstimate(442.078 ns)  TrialEstimate(30.709 μs)   missing         Inf
   5 │     10       6.0  GCNConv  sparsegraph  TrialEstimate(8.382 μs)    TrialEstimate(31.001 μs)   missing         Inf
   6 │    100       6.0  GCNConv  adjlist      TrialEstimate(2.533 μs)    TrialEstimate(112.584 μs)  missing         Inf
   7 │    100       6.0  GCNConv  dense        TrialEstimate(32.333 μs)   TrialEstimate(111.064 μs)  missing         Inf
   8 │    100       6.0  GCNConv  lightgraph   TrialEstimate(4.181 μs)    TrialEstimate(94.834 μs)   missing         Inf
   9 │    100       6.0  GCNConv  sparse       TrialEstimate(4.148 μs)    TrialEstimate(108.668 μs)  missing         Inf
  10 │    100       6.0  GCNConv  sparsegraph  TrialEstimate(80.376 μs)   TrialEstimate(105.001 μs)  missing         Inf
  11 │   1000       6.0  GCNConv  adjlist      TrialEstimate(26.625 μs)   TrialEstimate(760.026 μs)  missing         Inf
  12 │   1000       6.0  GCNConv  dense        TrialEstimate(1.197 ms)    TrialEstimate(932.298 μs)  missing         Inf
  13 │   1000       6.0  GCNConv  lightgraph   TrialEstimate(54.167 μs)   TrialEstimate(729.047 μs)  missing         Inf
  14 │   1000       6.0  GCNConv  sparse       TrialEstimate(39.834 μs)   TrialEstimate(938.006 μs)  missing         Inf
  15 │   1000       6.0  GCNConv  sparsegraph  TrialEstimate(825.986 μs)  TrialEstimate(887.924 μs)  missing         Inf
  16 │  10000       6.0  GCNConv  adjlist      TrialEstimate(146.918 μs)  TrialEstimate(8.673 ms)    missing         Inf
  17 │  10000       6.0  GCNConv  dense        TrialEstimate(122.352 ms)  TrialEstimate(8.651 ms)    missing         Inf
  18 │  10000       6.0  GCNConv  lightgraph   TrialEstimate(587.150 μs)  TrialEstimate(7.807 ms)    missing         Inf
  19 │  10000       6.0  GCNConv  sparse       TrialEstimate(333.669 μs)  TrialEstimate(10.331 ms)   missing         Inf
  20 │  10000       6.0  GCNConv  sparsegraph  TrialEstimate(7.334 ms)    TrialEstimate(7.313 ms)    missing         Inf

For GCNConv task, they have almost the same performance in CPU time over small graphs but sparse array and SparseGraph take the first and second place. As for large graph, SparseGraph have the less CPU time and sparse array costs the longest.

julia> df[df.layer .== "GraphConv", :]
20×8 DataFrame
 Row │ N       c         layer      gtype        time_fg                    time_cpu                   time_gpu  gpu_to_cpu 
     │ Int64?  Float64?  String?    Symbol?      Any                        Any                        Any       Float64?   
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │     10       6.0  GraphConv  adjlist      TrialEstimate(395.317 ns)  TrialEstimate(255.794 μs)  missing         Inf
   2 │     10       6.0  GraphConv  dense        TrialEstimate(1.444 μs)    TrialEstimate(263.335 μs)  missing         Inf
   3 │     10       6.0  GraphConv  lightgraph   TrialEstimate(369.883 ns)  missing                    missing     missing  
   4 │     10       6.0  GraphConv  sparse       TrialEstimate(417.866 ns)  TrialEstimate(210.189 μs)  missing         Inf
   5 │     10       6.0  GraphConv  sparsegraph  TrialEstimate(8.424 μs)    TrialEstimate(320.670 μs)  missing         Inf
   6 │    100       6.0  GraphConv  adjlist      TrialEstimate(2.900 μs)    TrialEstimate(4.173 ms)    missing         Inf
   7 │    100       6.0  GraphConv  dense        TrialEstimate(34.209 μs)   TrialEstimate(4.137 ms)    missing         Inf
   8 │    100       6.0  GraphConv  lightgraph   TrialEstimate(3.852 μs)    missing                    missing     missing  
   9 │    100       6.0  GraphConv  sparse       TrialEstimate(4.111 μs)    TrialEstimate(5.943 ms)    missing         Inf
  10 │    100       6.0  GraphConv  sparsegraph  TrialEstimate(80.293 μs)   TrialEstimate(4.488 ms)    missing         Inf
  11 │   1000       6.0  GraphConv  adjlist      TrialEstimate(27.250 μs)   TrialEstimate(214.105 ms)  missing         Inf
  12 │   1000       6.0  GraphConv  dense        TrialEstimate(1.197 ms)    TrialEstimate(199.505 ms)  missing         Inf
  13 │   1000       6.0  GraphConv  lightgraph   TrialEstimate(55.500 μs)   missing                    missing     missing  
  14 │   1000       6.0  GraphConv  sparse       TrialEstimate(38.000 μs)   TrialEstimate(281.798 ms)  missing         Inf
  15 │   1000       6.0  GraphConv  sparsegraph  TrialEstimate(821.590 μs)  TrialEstimate(209.873 ms)  missing         Inf
  16 │  10000       6.0  GraphConv  adjlist      TrialEstimate(151.335 μs)  missing                    missing     missing  
  17 │  10000       6.0  GraphConv  dense        TrialEstimate(122.266 ms)  missing                    missing     missing  
  18 │  10000       6.0  GraphConv  lightgraph   TrialEstimate(594.670 μs)  missing                    missing     missing  
  19 │  10000       6.0  GraphConv  sparse       TrialEstimate(335.211 μs)  missing                    missing     missing  
  20 │  10000       6.0  GraphConv  sparsegraph  TrialEstimate(7.461 ms)    missing                    missing     missing  

For GraphConv task, there are some isolated node in N=10000 graph such that GraphConv cannot pass it. I think it is fair enough to judge performance with N from 10 to 1000. Surprisingly, dense array have the best CPU time performance for large graphs, and the second one is SparseGraph.

Sadly, I didn't find COO format in the benchmark script. We can compare the top scores and the score here. The COO format has a bit lower CPU time than sparse array for large graph in GCNConv task and sparse array have the longest CPU time. As for GraphConv task, we only have N=1000 in both side, so we compare scores based on N=1000 case. The COO format has the longest CPU time (271.801 ms) among them, even dense and sparse array have better performance. The SparseGraph has sub-optimal performance (209.873 ms), which is a bit longer than dense (199.505 ms) but shorter than sparse array (281.798 ms). Thus, we can conclude that SparseGraph has better performance than COO format.

Conclusively, the best performance in GCNConv task and in GraphConv task are SparseGraph and dense array, respectively. In addition, SparseGraph takes second place in GraphConv task and dense array takes the last third place in GCNConv task. In average, SparseGraph has the potential to outperform in both spectral-based graph convolution tasks and spatial-based graph convolution tasks.

I decide to take new design of SparseGraph as the core of FeaturedGraph.

yuehhua commented 3 years ago

Thank you for huge effort. This helps decide the core of featured graph. I have merged yuehhua/GraphSignals.jl#54.

CarloLucibello commented 3 years ago

I'm really discomforted by this decision for several reasons:

So any decision steaming from performance or other considerations should be taken when this and the rest of #204 are merged. I kindly ask you to reconsider and merge this PR.

yuehhua commented 3 years ago

this PR decouples the interface from the internal representation, so any change from yuehhua/GraphSignals.jl#54 could be integrated without any breakage

I can realize the integration can be made easily, but the decouple of internal representation introduce complexity. I am confused about the pursuit of simplicity and maintainability. These two conflict. So, the decoupling here is really what we want? FeaturedGraph and internal graph representation provide highly similar functionality. It should be consider that they have high cohesion, instead of decoupling them. Am I right?

This PR + other changes from #204 would lead to 10x - 100x speedup in message passing layers on cpu.

I showed that SparseGraph have even better performance than COO format in both message passing layers and GCNConv layers on cpu. I would expect SparseGraph have even speedup in message passing layers on cpu.

This PR + other changes from #204 would allow for GPU support. yuehhua/GraphSignals.jl#54 doesn't give us GPU support.

That is a good point. The GPU support can also be made in the following PRs.

CarloLucibello commented 3 years ago

I can realize the integration can be made easily, but the decouple of internal representation introduce complexity. I am confused about the pursuit of simplicity and maintainability. These two conflict. So, the decoupling here is really what we want? FeaturedGraph and internal graph representation provide highly similar functionality. It should be consider that they have high cohesion, instead of decoupling them. Am I right?

I don't think I fully understand this comment. What I wanted to say is that while previously when constructing a FeaturedGraph(g) from some graph format g the graph format g became the inner implementation, so that we were completely giving over control allowing a variety of inefficient implementations, with this PR we take control back, and the implementation is selected through the keyword graph_type among a limited set for which we can guarantee good support. In this PR allowed graphtypes are :coo, :sparse and :dense. We could also drop :dense since it is inefficient for large graphs.

As this PR shows, supporting 2 or 3 graph types can be done with simplicity, the code here is much more concise, maintainable and simple than the one in GraphSignals.jl and GraphLaplacians.jl which this PR replace entirely. We could also reduce the supported types to only 1 type in the future if we manage to have a single implementation with good performance on all layers and good cuda support.

The changes in yuehhua/GraphSignals.jl#54 instead, where a FeatureGraph contains a SparseGraph that contains a SparseMatrix, seems to me to be going in the opposite direction of simplicity and maintainability.

I showed that SparseGraph have even better performance than COO format in both message passing layers and GCNConv layers on cpu. I would expect SparseGraph have even speedup in message passing layers on cpu.

I don't understand how you compared the performance of the COO implementation without running the benchmark script on this PR. In any case, it is kinda pointless comparing performance right now, because the current message massing is highly inefficient, I can obtain 10x and more speedup! We can compare performance in a meaningful way after revisiting the message passing scheme.

That is a good point. The GPU support can also be made in the following PRs.

Not really, not in the near future. yuehhua/GraphSignals.jl#54 has been made without gpu considerations in mind. CUDA.jl support for sparse matrices is still in early stage. The only gpu implementation that we can provide right now is through the COO format, it is already there and working very well in ~#200~ (Edit: sorry I meant #204). It is really a waste to renounce to that. How many months will we have to wait for gpu support? Is it really worth it when we can have that right now and has been ready for a month?

My point is, we can have 10x and more speedup on cpu message passing, cuda support, and the possibility to switch to any other backend in the future if we want to, without having to tag other breaking releases. It's a lot of good stuff and the work is already done, do you really want to throw all that away? Let's collect these goodies and move forward!

yuehhua commented 3 years ago

As this PR shows, supporting 2 or 3 graph types can be done with simplicity, the code here is much more concise, maintainable and simple than the one in GraphSignals.jl and GraphLaplacians.jl which this PR replace entirely. We could also reduce the supported types to only 1 type in the future if we manage to have a single implementation with good performance on all layers and good cuda support.

I think the key point here should be this sentence. First, you reduce the possibility of inner implementations for FeaturedGraph such that we don't have to support so many graph types. I agree with you at this point. So, ultimately, why don't we just support only one inner implementation for FeaturedGraph? As you mentioned here.

We could also reduce the supported types to only 1 type in the future

So, the changes in yuehhua/GraphSignals.jl#54 is just the way towards single graph type and the same direction simplicity and maintainability. SparseGraph supports the functionality of both algebraic operations and indexing operations. The underlying SparseMatrix itself is a matrix, so it naturally support algebraic operations, including +, -, *, / etc. The SparseMatrix stores index and values in CSC format which fits the need of indexing neighbor of a vertex in a graph.

SparseGraph has such good properties and why we have to consider other graph types? We both can see the goal is to support single graph type, which performs excellent for both algebraic operations and indexing operations. Fine. We just benchmark all the graph types and decide which graph type is the best for both operations and just make this graph type as inner implementation.

Thus, we go into the next stage: to decide which graph type has the best performance.

I don't understand how you compared the performance of the COO implementation without running the benchmark script on this PR.

Since the :coo option doesn't exist in the current benchmark script, I cannot benchmark the COO implementation. Indirectly, I compared benchmark for existing :dense, :sparse and compare the performance ranking in your benchmark result (n=1000 CPU time: :dense < :sparse < :coo) and my benchmark result (n=1000 CPU time: :dense < :sparsegraph < :sparse). We can draw the conclusion: n=1000 CPU time: :dense < :sparsegraph < :sparse < :coo.

I can obtain 10x and more speedup!

Just show me how you benchmark this and compare which one to which one.

Not really, not in the near future. yuehhua/GraphSignals.jl#54 has been made without gpu considerations in mind.

How can you see there is no gpu considerations in mind?

The only gpu implementation that we can provide right now is through the COO format, it is already there and working very well in #200.

Which is not true. Currently, CUSPARSE support many formats, including CuSparseMatrixCSC, CuSparseMatrixCSR, CuSparseMatrixBSR, CuSparseMatrixCOO. However, in your implementation, COO format is just three CuVector. What's the difference between using of CuSparseMatrixCOO. Furthermore, if you want to support more functionalities, you cannot get any advantages from CUSPARSE or existing interfaces, instead you have to implement them again.

it is already there and working very well in #200.

I got confused about why you reference an issue. An issue doesn't provide any implementation, not even saying it works well.

CarloLucibello commented 3 years ago

So, ultimately, why don't we just support only one inner implementation for FeaturedGraph?

This is what I did in #204, based on COO implementation but you strongly opposed it. Then I created this PR so that different implementations could be supported and compared. I said it many times already: the only GPU-compliant implementation we can have right now is COO, this is why either we switch entirely to that or we allow for a few different backends.

Thus, we go into the next stage: to decide which graph type has the best performance.

In order to compare we need a PR like this that allows for different backends, otherwise comparison will require a major rewrite each time. With yuehhua/GraphSignals.jl#54 to make comparisons becomes incredibly complicated.

Just show me how you benchmark this and compare which one to which one.

In order to benchmark this we need the other changes from #204. I have another branch with the changes to be made on top of this PR, but since this PR it's not getting merged I don't know what to do with them. You can try to checkout #204 and adapt the benchmark script to that PR. I'm very frustrated by the fact that you chose to do yuehhua/GraphSignals.jl#54 not months ago, not in a few months, but just now, nullifying a lot of my work. This is not something that makes you feel welcome to contribute to an open source project.

How can you see there is no gpu considerations in mind?

It contains a lot of scalar indexing. Try to add gpu tests for each of the methods, even basic ones such as ne.

Currently, CUSPARSE support many formats, including CuSparseMatrixCSC, CuSparseMatrixCSR, CuSparseMatrixBSR, CuSparseMatrixCOO. However, in your implementation, COO format is just three CuVector. What's the difference between using of CuSparseMatrixCOO

As you can read by yourself https://github.com/JuliaGPU/CUDA.jl/blob/fb7440fb568607de4318df4b2e26ad3bd73cf851/lib/cusparse/array.jl#L115 CuSparseMatrixCOO is scarcely supported (but the situation is not much better with other CuSparse formats). Anyways, if we have this PR, we can implement whatever backend we want, graph_type=:coo doesn't necessarily have to mean 3 vectors.

I got confused about why you reference an issue. An issue doesn't provide any implementation, not even saying it works well.

Sorry, I meant #204

yuehhua commented 3 years ago

Sorry to get you feel frustrated to contribute a project. It begins from #204. It gives too much redesign and makes it hard to review. Most importantly, the redesign collapses the backbone of this project and I already hinted you to check the paper first. But it doesn't work. Secondly, we have discuss new designs in #200, #201 and #202 and I agree with some points, but not all. But you insist to go your own way. I am forced to review the code from the part I disagree and it do cost my time. Obviously, we didn't reach a consensus. However, I am still glad to have these discussions, which clarify our minds and help us making some decisions. There are not always PR got merged and I give reasons as well. These issues and PRs are not worthless. Sorry not accepting your PR. I think I will end here.

CarloLucibello commented 3 years ago

It's fine to have diverging opinions and since you are the author of the package the ultimate responsibility to steer the direction of the package is yours. I feel there are some major design issues with the current situation that I wasn't given the opportunity to address. My interest is for the Julia ecosystem to have a simple and performant graph neural network library. Since in my opinion GeometricFlux.jl doesn't meet these requirements nor is willing to, I will consider forking/creating a new GNN package. That said, let me say that even though we have different design opinions, I highly appreciate the work you have done here. Thanks for all those reviews and discussions, Carlo