GATConv inconsistent with original implementation

jackd commented 3 years ago

Compared to the original repository, the GATConv implementation has some minor differences. In particular, for the original implementation:

a different dropout is applied to each head before dense transformations;
each head has dropout applied in 3 places - once before key/query computations, once on the attention values and once before sparse/dense multiplication; and
activation/bias addition is applied before concatenation/averaging.

I should point out that I think the original implementation is quite inefficient and the one contained here-in makes more sense from both a computational efficiency and code cleanliness perspective. Experiments show a slight drop in performance, but that can be offset somewhat by increasing the dropout rate (which makes sense, because the original implementation contains more dropout operations). That obviously opens a pandoras box though - why not change other hyperparameters to get better results?

Is the goal of this project to reproduce existing work, or to improve upon that work? Should this repo perhaps contain two versions - one as close as possible to the original, and another with more freedom to impose opinions on what is "right" / "cleaner"?

FWIW I have the original version implemented and am happy to contribute it.

danielegrattarola commented 3 years ago

Hi!

It's a delicate tradeoff between user-friendliness and reproducing the exact implementation from the literature.

In GAT's case, the dropout for features feels like something that should not be part of the layer, since it is more natural to have it as a separate layer. Dropout on the adjacency matrix could also be applied externally, but is specifically described in the paper as being a crucial aspect to the model's success, so I included it. Same as LeakyReLU and having parallel heads (although these two could not be applied from outside the layer). It's debatable and I am open to hearing feedback.

The concatenation-before-activation thing is because I wanted the activation to be the last thing applied to the model, to be consistent with the rest of Spektral/Keras.

Is the goal of this project to reproduce existing work, or to improve upon that work?

The project's priority is to be usable in a wide range of tasks, not necessarily to reproduce the configuration of the original papers (which are often limited in scope/overfitted to specific datasets). The example scripts should be as close as possible to the original experiments but I wouldn't expect them to reproduce the results down to the decimal points.

GAT is a particularly tricky one, because a lot of architectural choices were presented as part of the layer (whereas I think that the idea of attention should be treated separately from dropout, activations, etc). Usually there is less ambiguity.

I should point out that I think the original implementation is quite inefficient and the one contained here-in makes more sense from both a computational efficiency and code cleanliness perspective

I agree, which is also why I would keep the current GAT layer as-is.

That obviously opens a pandoras box though - why not change other hyperparameters to get better results?

Correct. I think that PR #147 is OK to merge, since it modifies the examples to be in line with the paper's config, but I wouldn't change too much just to match results. We're more than enough close.

Should this repo perhaps contain two versions - one as close as possible to the original, and another with more freedom to impose opinions on what is "right" / "cleaner"?

I think that would cause too much confusion. The layer should allow some degrees of freedom in choosing hparams, and the examples should show how to use those degrees of freedom to replicate the original papers as close as possible.

Again, this is just my personal opinions and I am open to have my mind changed.

Cheers, Daniele

jackd commented 3 years ago

I don't have any fundamental disagreements there. I suppose it might be worth documenting known differences either in the layer docstrings or wherever benchmarks results are documented?

danielegrattarola commented 3 years ago

Yes, that's a good idea. I'd go with the latter, personally. I'll leave this issue open until the benchmarks are done and these differences addressed.

danielegrattarola / spektral

GATConv inconsistent with original implementation #146