PetarV- / GAT

Graph Attention Networks (https://arxiv.org/abs/1710.10903)
https://petar-v.com/GAT/
MIT License
3.18k stars 642 forks source link

how to understand “dropping all structural information” #27

Closed guoyejun closed 5 years ago

guoyejun commented 5 years ago

Hi,

I'm trying to study GAT, and awesome works!

I'm wondering how to understand "without ... depending on knowing the graph structure upfront", and "dropping all structural information".

I also see "we only compute eij for nodes i belongs Ni, where Ni is some neighborhood of node i in the graph" in the paper, i also see function adj_to_bias() in the code which requires to know the adjacent matrix (the edges).

So my understanding is that we do need to know the graph structure (edge information) at the beginning, thanks.

PetarV- commented 5 years ago

Hi Yejun,

Thank you for the issue, and your kind interest in GAT!

The phrase "without depending on the graph structure upfront" refers to the training/testing routine. Namely, GAT is an inductive method---the mechanism it learns is in principle not conditioned on the graph it has been trained on. This means that, at test time, you can apply GAT to any structure you'd like (including ones unseen at training time). This is in stark contrast to many methods that were published before (which were transductive, and wouldn't in theory work outside of the graph they were trained on).

Regarding the second phrase, I believe you've misread the paper a little bit. From what I recall, the phrase appears here:

"In its most general formulation, the model allows every node to attend on every other node, dropping all structural information"

This is the formulation before masked attention is introduced (i.e. we just do all-pairs self-attention as in the Transformer paper). Indeed, in this version the graph is not used at all. Afterwards we introduce the neighbourhoods, and the graph structure is injected.

So, to confirm, the GAT model does not drop all structural information. It uses the local adjacency information of every node to determine which other nodes to attend over. That being said, it only needs the local information (i.e. a node does not need to know anything about a node that is outside of its neighbourhood).

Hope that helps! Let me know more clarification is needed.

Thanks, Petar

guoyejun commented 5 years ago

thanks Petar!

btw, how does GAT consider about edge with an arrow. My understanding is that it depends on the definition of 'neighborhood', it means that alpha of one direction is learned/calculated, while alpha of the other direction is just zero.

PetarV- commented 5 years ago

Hi Yejun,

The general answer is "it's up to you". :)

In the simplest case, as you suggested, the attention is simply not computed over one direction. Other authors like to include a notion of two "edge types" (inbound/outbound) and learn a separate set of attention heads for each edge type. I'd say -- it really depends on the problem you're trying to solve (and how expressive the edges actually are semantically), but ultimately the framework is quite flexible with respect to how you choose to approach this.

Thanks, Petar