bmaltais / kohya_ss

Apache License 2.0
9.37k stars 1.21k forks source link

What does number of beams do in captioning, and how do i use it. I get error on anything above 1. #2175

Closed TeKett closed 4 months ago

TeKett commented 5 months ago

Tensor B is a squared number of tensor A

10:38:16-980477 INFO     Version: v22.3.1

10:38:16-987458 INFO     nVidia toolkit detected
10:38:18-560252 INFO     Torch 2.0.1+cu118
10:38:18-584160 INFO     Torch backend: nVidia CUDA 11.8 cuDNN 8700
10:38:18-586155 INFO     Torch detected GPU: NVIDIA GeForce RTX 4070 Ti VRAM 12281 Arch (8, 9) Cores 60
10:38:18-588150 INFO     Verifying modules installation status from requirements_windows_torch2.txt...
10:38:18-593163 INFO     Verifying modules installation status from requirements.txt...
10:38:21-600523 INFO     headless: False
10:38:21-607532 INFO     Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
10:51:20-341524 INFO     Captioning files in D:\Pictures\New folder (3)...
10:51:20-342520 INFO     ./venv/Scripts/python.exe "finetune/make_captions.py" --batch_size="1" --num_beams="2"
                         --top_p="0.9" --max_length="75" --min_length="35" --beam_search --caption_extension=".txt"
                         "D:\Pictures\New folder (3)"
                         --caption_weights="https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/mode
                         l_large_caption.pth"
Current Working Directory is:  C:\Train\Kohya
load images from D:\Pictures\New folder (3)
found 2 images.
loading BLIP caption: https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth
BLIP loaded
  0%|                                                                                            | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "C:\Train\Kohya\finetune\make_captions.py", line 200, in <module>
    main(args)
  File "C:\Train\Kohya\finetune\make_captions.py", line 144, in main
    run_batch(b_imgs)
  File "C:\Train\Kohya\finetune\make_captions.py", line 97, in run_batch
    captions = model.generate(
  File "C:\Train\Kohya\finetune\blip\blip.py", line 158, in generate
    outputs = self.text_decoder.generate(input_ids=input_ids,
  File "C:\Train\Kohya\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Train\Kohya\venv\lib\site-packages\transformers\generation\utils.py", line 1611, in generate
    return self.beam_search(
  File "C:\Train\Kohya\venv\lib\site-packages\transformers\generation\utils.py", line 2909, in beam_search
    outputs = self(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 886, in forward
    outputs = self.bert(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 781, in forward
    encoder_outputs = self.encoder(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 445, in forward
    layer_outputs = layer_module(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 361, in forward
    cross_attention_outputs = self.crossattention(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 277, in forward
    self_outputs = self.self(
  File "C:\Train\Kohya\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Train\Kohya\finetune\blip\med.py", line 178, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (2) must match the size of tensor b (4) at non-singleton dimension 0
bmaltais commented 5 months ago

Not sure how it actually work, but this is an issue with kohya python script. You should open an issue directly on his repository sd-script.

5KilosOfCheese commented 5 months ago

Beam search branching method of searching, where at every layer you have all the possibilities (in this case tokens). Instead of choosing just 1 option which has the highest score, you choose the k-number of best possibilities and follow up on them. The K-value is beam width. So if you have beam width of two you always consider two highest scoring options at each step. So if you have a sentence that start with "This" and want to figure out which is mostlikely next answer, you check against all the tokens and get "Girl", the problem is that it is possible that girl is most likely just due to the model bias and isn't actually correct. But with beam width of 2 we can have "Girl" and "potato" - and potato is the "correct". And so on and so forth you go until you get "This potato smells odd" or whatever.

Number of beams is how many different theories we have about what the sentence might be. With 1 beam we have:

  1. This (girl, potato, thing...)

With 2 beams we can have

  1. This (girl, potato, thing...)
  2. This (boy, cheese, weather...)

But because we evaluate both beams at the same time. Because it would be pointless to two do evaluations of the same options. You'd just get two of the same path. So we are trying to give the matrix with size of two twice to the system: ([a,b],[c,d]). But the system expects us to give it something like [a,b,c,d]. Because it evaluates the whole thing at once instead of both separately. The reason why it works with number of beams 1, is because we give matrix of size 2 and it expects 2. But if you increase the number of beams it expects bigger matrix at that point. But we only have two smaller ones to offer, while it expects only one which is bigger. So we need the two size 2 matrixes to be one size 4 matrix.

Or so I have understood this problem when reading in to this issue. I don't have what it takes to fix it, but this seems to be the common consensus and suggested solution.

bmaltais commented 5 months ago

This is something @kohya-ss would need to fix in the caption script… or possibly remove support for beam larger than 1 and force it to 1 all the time since larger beam counts fails.

TeKett commented 5 months ago

The reason why it works with number of beams 1, is because we give matrix of size 2 and it expects 2. But if you increase the number of beams it expects bigger matrix at that point. But we only have two smaller ones to offer, while it expects only one which is bigger. So we need the two size 2 matrixes to be one size 4 matrix.

But why is it squaring the dim0 value? That's the main visual problem. I dont know the code, so i have no idea what the tensor contains. What exactly is stored in dim0? For tensor A its a value equal to the number of beams, and for tensor B its a value equal to the square of the number of beams.

For the number of batches then it does multiply it and its working (with 1 beam). 5 batches with 1 beam turns into a value of 5. 3 batches with 2 beams turn into a value of 12.

If i use 64 beams i get: RuntimeError: The size of tensor a (64) must match the size of tensor b (4096) at non-singleton dimension 0

Im getting confused over the terminology. We call a 3D Array a Tensor, but a Scalar, Vector, and Matrix are also Tensors. Does these correspond to dim0, dim1, dim2, and dim3? So would that mean its 4 dimensional where the first one is the number of cubes? Or rather dim0 is our beam, and dim1-3 is the beams data? Or is it way way more complex then this?

DarkViewAI commented 5 months ago

You can use different amount of beams, its an issue with transformers. I forget but theres an older version of transformers you can downgrade too that fixes the blip beam issues

TeKett commented 5 months ago

You can use different amount of beams, its an issue with transformers. I forget but theres an older version of transformers you can downgrade too that fixes the blip beam issues

Wouldn't that mean the issue is that the script, that is importing and using things from Transformers, is no longer compatible with the new version, and needs to be updated?

Could the issue be here? blip.py, line 130

def generate(self, image, sample=False, num_beams=3, max_length=30, min_length=10, top_p=0.9, repetition_penalty=1.0):
        image_embeds = self.visual_encoder(image)

        if not sample:
            image_embeds = image_embeds.repeat_interleave(num_beams,dim=0)

Doesn't this duplicate the elements by the number of beams if its not a sample? Effectively making the number of elements squared? I tried with just reversing the logic as a test so it gets bypassed, and now its at lest not erroring out, and can do the caption. Not sure if there are side effects.

TeKett commented 5 months ago

Yea i found the side effect, and the reason for the issue. When you use beam search then it should be a sample, when not using beam search it should not be a sample. Currently both are not samples.

kohya-ss commented 5 months ago

This seem to be caused by the breaking change of transformers. I finally found how to fix this and updated dev branch of sd-scripts. It will be merged into main sooner.

https://github.com/kohya-ss/sd-scripts/commit/f1f30ab4188223081aa96329a75bc4a99672b411