Question About Softmax in Pytorch Code

https://github.com/CAMMA-public/rendezvous/blob/16490a3b72fa3d571c00d5dd70052513f789a48b/pytorch/network.py#L319-L350

Hello, Thank You for Your nice work!

I've recently been working with the pytorch version of the code and have noticed a small issue I found a bit odd.

In the MHMA module of "network.py" file, the code applies self and cross attention in the "scale_dot_product" function (line 329).

In it, attention weights are computed (which should be of dimension [B, Head, D, D] = [32, 4, 128, 128]). And then softmax is applied on them. However, softmax (defined in line 326) is applied on dim=1, i.e. on the Head dimension.

I was wondering if this is a typo and instead it should be nn.Softmax(dim=-1)?

A similar issue arrises in CAGAM module earlier.

The inputs to "scale_dot_product" function are of the following dimensions:

1) key [B,Head,D,1], where B=batch_size=32 (by default) | Head=num_of_attention_heads=4 | D=internal_channel_dimension=128 2) query [B,Head,D,1] 3) value [B,Head,D, H*W], where H,W = image_height, image_width (of Class Activation Maps from the Encoder) = 32, 56 if hr_output=True or 8, 14 otherwise

CAMMA-public / rendezvous

Question About Softmax in Pytorch Code #10