Question about the architecture (graphTransformer)

Forbu commented 5 months ago

I was looking at your implementation of attention here : https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158

I have some question about the code :

Q = Q.unsqueeze(2)  # (bs, 1, n, n_head, df)
K = K.unsqueeze(1)  # (bs, n, 1, n head, df)

# Compute unnormalized attentions. Y is (bs, n, n, n_head, df)
Y = Q * K

Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).

Also a few line after we have :

attn = masked_softmax(Y, softmax_mask, dim=2)  # bs, n, n, n_head
print("attn.shape : ", attn.shape) # i add this

As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment). The code is not really implementing "real" graph transformer attention like other code like : https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer

But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is something that the authors made intentionnally.

cvignac commented 5 months ago

Copying the answer from https://github.com/cvignac/DiGress/issues/47

your observation is correct. It’s not exactly the standard attention mechanism. I’ve not thoroughly compared the two, but current code was written on purpose. The reason for this is that we have to manipulate features of size (bs, n, n, de) anyway, so using vector attention scores instead of scalar does not create a strong memory bottleneck. I would be interesting to investigate this further, though.

Clement

Le mar. 30 janv. 2024 à 14:59, Adrien B @.***> a écrit :

I was looking at your implementation of attention here :

https://github.com/cvignac/DiGress/blob/main/src/models/transformer_model.py#L158

I have some question about the code :

Q = Q.unsqueeze(2) # (bs, 1, n, n_head, df)K = K.unsqueeze(1) # (bs, n, 1, n head, df)

Compute unnormalized attentions. Y is (bs, n, n, n_head, df)Y = Q * K

Here I have a question because in the classic attention mecanism we have Y which have a dimension of (bs, n, n, n_head) not feature specific. I don't know if this what the author wanted (this is not proper outer product this is element wise multiplication).

Also a few line after we have :

attn = masked_softmax(Y, softmax_mask, dim=2) # bs, n, n, n_headprint("attn.shape : ", attn.shape) # i add this

As the attention shape I obtain (bs, n, n, n_head, df) dimension (contrary to the comment). The code is not really implementing "real" graph transformer attention like other code like : https://docs.dgl.ai/_modules/dgl/nn/pytorch/gt/egt.html#EGTLayer

But as your code give me better results than the one above (with a proper attention mecanism) I wonder if this is not something that the authors made intentionnally.

— Reply to this email directly, view it on GitHub https://github.com/cvignac/DiGress/issues/87, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJOOTTRSBNRAKC3ZHJCDQ3YREDDPAVCNFSM6AAAAABCRM4D6CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGEYDQMBWHEZTQNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Clément Vignac

Forbu commented 5 months ago

I am doing some experiment on my own graph dataset. Your implementation seems to be more performant that the standard graph transformer (at least the one I tried from DGL library). Yours clearly achieve to generate more plausible edges. I am doing more experiements to confirm this (I currently only have "visual" clues and noisy loss curves to back this affirmation).

Your implementation is equivalent of having a classic graph transformer but with as many head as original dimension, so you ends up having heads of only one dimension (I mean if df = 1 you will obtain the same results).

cvignac / DiGress

Question about the architecture (graphTransformer) #87

Compute unnormalized attentions. Y is (bs, n, n, n_head, df)Y = Q * K