graidl commented 1 year ago

I suggest to add a transformer-like convolutional layer. The proposed implementation follows TransformerConv from Pytorch Geometric but also contains a few additional options. More specifically, the transformer-like multi head attention convolutional operator from the Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification paper with optional edge features is realized. Additionally, there are options to configure it as the transformer-like convolutional operator from the Attention, Learn to Solve Routing Problems! paper, including a successive feed-forward network as well as skip layers and batch normalization.

Carlo: edited with the correct link to the Attention, Learn to solve... paper

codecov[bot] commented 1 year ago

Codecov Report

Merging #249 (7aba1f2) into master (2020e58) will increase coverage by 0.55%. The diff coverage is 95.52%.

@@            Coverage Diff             @@
##           master     #249      +/-   ##
==========================================
+ Coverage   80.68%   81.23%   +0.55%     
==========================================
  Files          17       17              
  Lines        1734     1801      +67     
==========================================
+ Hits         1399     1463      +64     
- Misses        335      338       +3

Impacted Files	Coverage Δ
src/layers/conv.jl	`79.28% <95.52%> (+2.34%)`	:arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

graidl commented 1 year ago

Note that the Julia 1.6 CI run indicates two failed tests, however not in respect to TransformerConv but GATConv, which the PR doesn't touch. I tried to reproduce these test failures on my side, but it seems they only happen with a small probability. Maybe the tests are not deterministic, and it would be a good idea to include a constant random seed in testrun.jl and/or the individual test files?

CarloLucibello commented 1 year ago

Maybe the tests are not deterministic

yes, this is the case. Let's see if the recent changes in the numerical stability of GATConv have made them more robust

CarloLucibello commented 1 year ago

I'm worried about the two papers using two different normalizations, maybe we should consider layers just as basic blocks and try to not conflate too much.

So I suggest the following:

TransformerConv implements the Masked Label Prediction paper, and from the current state of the PR we remove skip_connection, batch_norm, ff_channels. So basically TransformerConv here becomes the same as the on in PyG.
We create a new models/ folder at the root of this repo, where we try to reproduce the exact architectures from papers. No need for full training examples, just a model constructors. "Kool et al." can be implemented as a GNNChain of TransformerConvs, LayerNorms and MLPs, SkipConnections.

CarloLucibello commented 1 year ago

@graidl what do you think? I'm not sure what is the right path here

graidl commented 1 year ago

Thanks for reviewing the proposed extension so quickly, your feedback is much appreciated, and you indeed address important questions! Let me start with a little background so that you understand my motivation for some of the so far made decisions: Originally, I started with the model from Kool et al. https://arxiv.org/abs/1803.08475. As in the famous paper on the classical Transformer architecture from Vaswani et al. https://arxiv.org/abs/1706.03762 (not related to graphs), Kool et al. apply batch normalization and skip layers in conjunction with the message passing as cenrtal building block of the whole model. This building block appears in many works relating to transformers in general, I would even say that it is nowadays the standard ingredient added to the propagation. As we needed in our work also the consideration of edge features, I stomped at some point at the Pytorch Geometric implementation of TransformerConv, which relies on the newer paper by Shi et al. https://arxiv.org/pdf/2009.03509.pdf. Principles are very similar, yet there are several smaller differences here and there which somehow move the approach a little away from the classical Transformer, for example in the normalization and with the gating mechanism. Note that Kool et al. use self-loops and self-attention instead of the gating, which imho is clearly more in the line of the classical Transformer architechture. Unfortunately, Shi et al., although newer, do not provide an in-depth experimental comparison to former Transformer-based GNNs, except GAT, and personally I am therefore not that convinced of the several smaller deviations they apply. A more serious comparison would be interesting, and we intend to do this in one of our works. Ultimately, I ended up in trying to combine aspects of the different approaches and lent a few ideas of Pytorch's TransformerConv implementation.

Now to your specific suggestions:

Concerning the gating mechanism (beta): As said, this does not appear in Kool et al. and can imho be replaced by self-attention, and I would therefore tend towards having it switched off as default. Note that also Pytorch's TransformerConv has it switched off by default, too.
Concerning normalization, skip layers and the feed-forward layers: I am also quite unsure what the best way is to deal with them. In some way, they feel to me as belonging to what a "transformer makes a transformer", just as the gating mechanism is currently included following the Shi et al paper. However, note that a whole transformer model is much more than this, having an encoder and decoder part etc. Thus, it wouldn't feel right to call the combination of the attention layer with BN and skip layer a (whole) "model" and put it into a new model-folder. It is a just small building block, or a kind of "standard extension" to the bare convolution mechanism. To conclude, I would still tend towards leaving BN, skip layers and the FF part as suggested as optional components in TransformerConv, but I don't have a strong opinion on this. If you feel it is better to keep things slim, I am equally happy with removing these parts. Clearly, for the users of the library/layer, it should not be a big deal to compose such structures on their own.

Let me know how you prefer to proceed, and I will prepare a revision considering all your remarks, thanks!

CarloLucibello commented 1 year ago

I'm merging this PR as it is, beside its utility per se the layer is a very good template for implementing many transformer-like blocks. Thanks for the effort, an impeccable PR.

CarloLucibello / GraphNeuralNetworks.jl

Add a transformer-like convolutional layer #249

Codecov Report