Open jdgh000 opened 1 week ago
HI @jdgh000 ,
Your code seems to be perfectly right. I am guessing the skewed values you're concerned about are just the effect of the softmax function. Softmax does skew values a lot when you increase the scale of the inputs, even if the proportion between the two input values is the same.
If we take two values, say, 0.01
and 0.1
, the second is 10x larger than the first, but softmax will return fairly similar results for both:
print(F.softmax(torch.as_tensor([.01, .1]), dim=-1))
tensor([0.4775, 0.5225])
However, if we multiply these values by 10, their proportion remains unchanged, but their overall level is 10x higher, thus affecting how softmax transforms them:
print(F.softmax(torch.as_tensor([.01, .1])*10, dim=-1))
tensor([0.2891, 0.7109])
If we try 100x larger than the initial values, we start seeing the kind of extremely skewed values you mentioned:
print(F.softmax(torch.as_tensor([.01, .1])*100, dim=-1))
tensor([1.2339e-04, 9.9988e-01])
So, it all boils down to the fact that the softmax function, since it exponentiates the inputs in order to transform them into probabilities adding up to one. Does this answer your question?
Best, Daniel
yes, i think so. it may be interesting to pursue this path, but i ;d move on. Just wanted to see if my understanding is correct through some sample code. btw, this appears simples explanation on softmax https://victorzhou.com/blog/softmax/
So I am into attention network , one of the toughest to understand and book so far explains great, however on scaled dot product example, show scaling by 100 the product of ks,q skews. I extended this example by using actual key and query from earlier example (in p262, in order to compute dim which happens to be just 2) and compared non-scaled (p275) and scaled side by side. But on scaled one, it still seems to be a big variance wtih prod vs. 100*prod,, I was hoping to see similar despite prod is multiplied by 100 or must be donig something wrong....:
Result: