Closed TeKett closed 4 months ago
Not sure how it actually work, but this is an issue with kohya python script. You should open an issue directly on his repository sd-script.
Beam search branching method of searching, where at every layer you have all the possibilities (in this case tokens). Instead of choosing just 1 option which has the highest score, you choose the k-number of best possibilities and follow up on them. The K-value is beam width. So if you have beam width of two you always consider two highest scoring options at each step. So if you have a sentence that start with "This" and want to figure out which is mostlikely next answer, you check against all the tokens and get "Girl", the problem is that it is possible that girl is most likely just due to the model bias and isn't actually correct. But with beam width of 2 we can have "Girl" and "potato" - and potato is the "correct". And so on and so forth you go until you get "This potato smells odd" or whatever.
Number of beams is how many different theories we have about what the sentence might be. With 1 beam we have:
With 2 beams we can have
But because we evaluate both beams at the same time. Because it would be pointless to two do evaluations of the same options. You'd just get two of the same path.
So we are trying to give the matrix with size of two twice to the system: ([a,b],[c,d])
. But the system expects us to give it something like [a,b,c,d]
. Because it evaluates the whole thing at once instead of both separately.
The reason why it works with number of beams 1, is because we give matrix of size 2 and it expects 2. But if you increase the number of beams it expects bigger matrix at that point. But we only have two smaller ones to offer, while it expects only one which is bigger. So we need the two size 2 matrixes to be one size 4 matrix.
Or so I have understood this problem when reading in to this issue. I don't have what it takes to fix it, but this seems to be the common consensus and suggested solution.
This is something @kohya-ss would need to fix in the caption script… or possibly remove support for beam larger than 1 and force it to 1 all the time since larger beam counts fails.
The reason why it works with number of beams 1, is because we give matrix of size 2 and it expects 2. But if you increase the number of beams it expects bigger matrix at that point. But we only have two smaller ones to offer, while it expects only one which is bigger. So we need the two size 2 matrixes to be one size 4 matrix.
But why is it squaring the dim0 value? That's the main visual problem. I dont know the code, so i have no idea what the tensor contains. What exactly is stored in dim0? For tensor A its a value equal to the number of beams, and for tensor B its a value equal to the square of the number of beams.
For the number of batches then it does multiply it and its working (with 1 beam). 5 batches with 1 beam turns into a value of 5. 3 batches with 2 beams turn into a value of 12.
If i use 64 beams i get: RuntimeError: The size of tensor a (64) must match the size of tensor b (4096) at non-singleton dimension 0
Im getting confused over the terminology. We call a 3D Array a Tensor, but a Scalar, Vector, and Matrix are also Tensors. Does these correspond to dim0, dim1, dim2, and dim3? So would that mean its 4 dimensional where the first one is the number of cubes? Or rather dim0 is our beam, and dim1-3 is the beams data? Or is it way way more complex then this?
You can use different amount of beams, its an issue with transformers. I forget but theres an older version of transformers you can downgrade too that fixes the blip beam issues
You can use different amount of beams, its an issue with transformers. I forget but theres an older version of transformers you can downgrade too that fixes the blip beam issues
Wouldn't that mean the issue is that the script, that is importing and using things from Transformers, is no longer compatible with the new version, and needs to be updated?
Could the issue be here?
blip.py, line 130
def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):
image_embeds = self.visual_encoder(image)
if not sample:
image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)
Doesn't this duplicate the elements by the number of beams if its not a sample? Effectively making the number of elements squared? I tried with just reversing the logic as a test so it gets bypassed, and now its at lest not erroring out, and can do the caption. Not sure if there are side effects.
Yea i found the side effect, and the reason for the issue. When you use beam search then it should be a sample, when not using beam search it should not be a sample. Currently both are not samples.
This seem to be caused by the breaking change of transformers
. I finally found how to fix this and updated dev
branch of sd-scripts. It will be merged into main sooner.
https://github.com/kohya-ss/sd-scripts/commit/f1f30ab4188223081aa96329a75bc4a99672b411
Tensor B is a squared number of tensor A