Open eileenforwhat opened 1 year ago
Hi @eileenforwhat, it's been a while since I have worked on this, so I needed to think about it myself. This snippet is taken from lucidrains code: https://github.com/lucidrains/flamingo-pytorch/blob/10913abbc8b2ceabb2320560d7d9b85fcb85eee3/flamingo_pytorch/flamingo_pytorch.py#L170 where he does the same.
consider this toy example:
import torch
mask = torch.tensor([0,0,1,1], dtype=bool)
x = torch.tensor([1,2,3,4], dtype=float)
print("mask:", mask)
print("inverted mask:", ~mask)
x = x.masked_fill(~mask, -torch.finfo(x.dtype).max)
print("x:", x)
x = x - x.amax(dim=-1, keepdim=True).detach()
print("x:", x)
alphas = x.softmax(dim=-1)
print("alphas: ", alphas)
which gives this result:
$ python test.py
mask: tensor([False, False, True, True])
inverted mask: tensor([ True, True, False, False])
x: tensor([-1.7977e+308, -1.7977e+308, 3.0000e+00, 4.0000e+00],
dtype=torch.float64)
x: tensor([-1.7977e+308, -1.7977e+308, -1.0000e+00, 0.0000e+00],
dtype=torch.float64)
alphas: tensor([0.0000, 0.0000, 0.2689, 0.7311], dtype=torch.float64)
here, subtracting the maximum value of 4 is not "normalizing", it shifts the largest value to zero. In fact, it does not change the result of the softmax operation, so my assumption is that it is done for numerical stability (..?)
note that setting the values we want to mask to -infinity will result in 0 after the softmax operation, which is what we want to achieve.
Hope this helps!
I see. This makes sense -- Thank you!
sure, feel free to ask if you have any more doubts :)
Hi, I'm unsure about this piece of code in
MaskedCrossAttention
insidegated_cross_attention.py
It seems you are setting the positions you want to mask out to
-torch.finfo(sim.dtype).max
(large negative number), but then finding the largest valuesim.amax
to normalize by?I would think it should be:
Any clarification on this logic is appreciated. Thanks!