d8ahazard / sd_smartprocess

Smart Pre-processing extension for Stable Diffusion
196 stars 19 forks source link

RuntimeError: The size of tensor a (8) must match the size of tensor b (64) at non-singleton dimension 0 #43

Open mhvelplund opened 1 year ago

mhvelplund commented 1 year ago

With the following settings on a folder containing 20 512x512 images. The GPU is a RTX 3080 with 16 Gb of VRAM.

image
Loading captioning models...
Loading CLIP interrogator...
Loading CLIP model from ViT-H-14/laion2b_s32b_b79k
Loading BLIP model...
load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_large_caption.pth
Loading CLIP model...
Loaded CLIP model and data in 9.95 seconds.
Preprocessing...
  0%|                                                                                                                                                              | 0/21 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\extensions\sd_smartprocess\smartprocess.py", line 360, in preprocess
    full_caption = build_caption(img) if caption else None
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\extensions\sd_smartprocess\smartprocess.py", line 159, in build_caption
    tags = clip_interrogator.interrogate(img, max_flavors=clip_max_flavors)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\extensions\sd_smartprocess\clipinterrogator.py", line 193, in interrogate
    caption = self.generate_caption(image)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\extensions\sd_smartprocess\clipinterrogator.py", line 174, in generate_caption
    caption = self.blip_model.generate(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\blip.py", line 156, in generate
    outputs = self.text_decoder.generate(input_ids=input_ids,
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py", line 1604, in generate
    return self.beam_search(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\transformers\generation\utils.py", line 2902, in beam_search
    outputs = self(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 886, in forward
    outputs = self.bert(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 781, in forward
    encoder_outputs = self.encoder(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 445, in forward
    layer_outputs = layer_module(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 361, in forward
    cross_attention_outputs = self.crossattention(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 277, in forward
    self_outputs = self.self(
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\venv\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\mandr\Documents\Projects\sd\stable-diffusion-webui\repositories\BLIP\models\med.py", line 178, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (8) must match the size of tensor b (64) at non-singleton dimension 0
mhvelplund commented 1 year ago

After setting "Number of CLIP beams" to 1, the process ran correctly. Tried with 2, but then it fails again.

alinsavix commented 1 year ago

You can get around this in the super short term by changing your requirements.txt to use an older version of transformers (use transformers==4.26.1, to be specific). That probably breaks other things, though. This seems to affect every package that does CLIP interrogation... I have no idea what it takes to do a correct fix, sadly. :(