FabianFuchsML / se3-transformer-public

code for the SE3 Transformers paper: https://arxiv.org/abs/2006.10503
475 stars 69 forks source link

What is the logic behind using the 'div' parameter? #6

Closed ufimtsev closed 3 years ago

ufimtsev commented 3 years ago

Hi Fabian,

Thanks for the great code. I wanted to ask why you had to scale the channels by div in the attention layer... Is it akin to scaling the channels in vanilla transformer by the number of heads? Or were there other considerations like memory/speed involved? Thanks!

FabianFuchsML commented 3 years ago

Hi!

Thanks for your interest! Yes, you are exactly right with both assumptions. The hyperparameter div is supposed to mimic what people do with non-equivariant transformers plus it makes it more efficient.

ufimtsev commented 3 years ago

Hi Fabian,

Thanks for your answer. May I ask you another question here so we wouldn't need to create a new issue... Namely, are a node's own key and value used to compute the weighted sum of values? It seems you query each node neighbors' keys and dot them by the node's query, which is how attention works I suppose, but it seems that the node's own key is not dotted by the node's query and the node's own value is not added to the sum. Is it designed to work so or am I wrong somewhere? Sorry if this is an obvious question, I have very little experience with transformers, and no experience whatsoever with dgl... Thanks again for the great code!

FabianFuchsML commented 3 years ago

You are correct in your observation. The reason is the following: the way we compute keys and queries uses the spherical harmonics evaluated at the relative position. If node_key==node_query, the relative position will be 0, which would cause problems. Hence, we don't 'attend' to the query node.