Closed ufimtsev closed 3 years ago
Hi!
Thanks for your interest! Yes, you are exactly right with both assumptions. The hyperparameter div
is supposed to mimic what people do with non-equivariant transformers plus it makes it more efficient.
Hi Fabian,
Thanks for your answer. May I ask you another question here so we wouldn't need to create a new issue... Namely, are a node's own key and value used to compute the weighted sum of values? It seems you query each node neighbors' keys and dot them by the node's query, which is how attention works I suppose, but it seems that the node's own key is not dotted by the node's query and the node's own value is not added to the sum. Is it designed to work so or am I wrong somewhere? Sorry if this is an obvious question, I have very little experience with transformers, and no experience whatsoever with dgl... Thanks again for the great code!
You are correct in your observation. The reason is the following: the way we compute keys and queries uses the spherical harmonics evaluated at the relative position. If node_key==node_query, the relative position will be 0, which would cause problems. Hence, we don't 'attend' to the query node.
Hi Fabian,
Thanks for the great code. I wanted to ask why you had to scale the channels by div in the attention layer... Is it akin to scaling the channels in vanilla transformer by the number of heads? Or were there other considerations like memory/speed involved? Thanks!