CarloLucibello / GraphNeuralNetworks.jl

Graph Neural Networks in Julia
https://carlolucibello.github.io/GraphNeuralNetworks.jl/dev/
MIT License
214 stars 47 forks source link

Add a transformer-like convolutional layer #249

Closed graidl closed 1 year ago

graidl commented 1 year ago

I suggest to add a transformer-like convolutional layer. The proposed implementation follows TransformerConv from Pytorch Geometric but also contains a few additional options. More specifically, the transformer-like multi head attention convolutional operator from the Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification paper with optional edge features is realized. Additionally, there are options to configure it as the transformer-like convolutional operator from the Attention, Learn to Solve Routing Problems! paper, including a successive feed-forward network as well as skip layers and batch normalization.

Carlo: edited with the correct link to the Attention, Learn to solve... paper

codecov[bot] commented 1 year ago

Codecov Report

Merging #249 (7aba1f2) into master (2020e58) will increase coverage by 0.55%. The diff coverage is 95.52%.

@@            Coverage Diff             @@
##           master     #249      +/-   ##
==========================================
+ Coverage   80.68%   81.23%   +0.55%     
==========================================
  Files          17       17              
  Lines        1734     1801      +67     
==========================================
+ Hits         1399     1463      +64     
- Misses        335      338       +3     
Impacted Files Coverage Δ
src/layers/conv.jl 79.28% <95.52%> (+2.34%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

graidl commented 1 year ago

Note that the Julia 1.6 CI run indicates two failed tests, however not in respect to TransformerConv but GATConv, which the PR doesn't touch. I tried to reproduce these test failures on my side, but it seems they only happen with a small probability. Maybe the tests are not deterministic, and it would be a good idea to include a constant random seed in testrun.jl and/or the individual test files?

CarloLucibello commented 1 year ago

Maybe the tests are not deterministic

yes, this is the case. Let's see if the recent changes in the numerical stability of GATConv have made them more robust

CarloLucibello commented 1 year ago

I'm worried about the two papers using two different normalizations, maybe we should consider layers just as basic blocks and try to not conflate too much.

So I suggest the following:

CarloLucibello commented 1 year ago

@graidl what do you think? I'm not sure what is the right path here

graidl commented 1 year ago

Thanks for reviewing the proposed extension so quickly, your feedback is much appreciated, and you indeed address important questions! Let me start with a little background so that you understand my motivation for some of the so far made decisions: Originally, I started with the model from Kool et al. https://arxiv.org/abs/1803.08475. As in the famous paper on the classical Transformer architecture from Vaswani et al. https://arxiv.org/abs/1706.03762 (not related to graphs), Kool et al. apply batch normalization and skip layers in conjunction with the message passing as cenrtal building block of the whole model. This building block appears in many works relating to transformers in general, I would even say that it is nowadays the standard ingredient added to the propagation. As we needed in our work also the consideration of edge features, I stomped at some point at the Pytorch Geometric implementation of TransformerConv, which relies on the newer paper by Shi et al. https://arxiv.org/pdf/2009.03509.pdf. Principles are very similar, yet there are several smaller differences here and there which somehow move the approach a little away from the classical Transformer, for example in the normalization and with the gating mechanism. Note that Kool et al. use self-loops and self-attention instead of the gating, which imho is clearly more in the line of the classical Transformer architechture. Unfortunately, Shi et al., although newer, do not provide an in-depth experimental comparison to former Transformer-based GNNs, except GAT, and personally I am therefore not that convinced of the several smaller deviations they apply. A more serious comparison would be interesting, and we intend to do this in one of our works. Ultimately, I ended up in trying to combine aspects of the different approaches and lent a few ideas of Pytorch's TransformerConv implementation.

Now to your specific suggestions:

Let me know how you prefer to proceed, and I will prepare a revision considering all your remarks, thanks!

CarloLucibello commented 1 year ago

I'm merging this PR as it is, beside its utility per se the layer is a very good template for implementing many transformer-like blocks. Thanks for the effort, an impeccable PR.