Issue with inference on swin_upernet models: ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration.

wesleyr36 commented 7 months ago

I was trying to test out the pre trained swin_upernet model you provided but uncountered the following:

Traceback (most recent call last):
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 99, in <module>
    proc_folder(None)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 95, in proc_folder
    run_folder(model, args, config, device, verbose=False)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\inference.py", line 44, in run_folder
    res = demix_track(config, model, mixture, device)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\utils.py", line 62, in demix_track
    x = model(part.unsqueeze(0))[0]
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Music-Source-Separation-Training-main\Music-Source-Separation-Training-main\models\upernet_swin_transformers.py", line 201, in forward
    x = self.swin_upernet_model(x).logits
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\upernet\modeling_upernet.py", line 406, in forward
    outputs = self.backbone.forward_with_filtered_kwargs(
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\utils\backbone_utils.py", line 210, in forward_with_filtered_kwargs
    return self(*args, **filtered_kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 1313, in forward
    embedding_output, input_dimensions = self.embeddings(pixel_values)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 263, in forward
    embeddings, output_dimensions = self.patch_embeddings(pixel_values)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\wesle\AppData\Roaming\Python\Python39\site-packages\transformers\models\swin\modeling_swin.py", line 315, in forward
    raise ValueError(
ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration.

I've made no changes to the configs and have tried updating my packages but no luck.

ZFTurbo commented 7 months ago

I can't reproduce your error. Please check which versions you have:

torch>=2.0.1
transformers==4.35.0

jarredou commented 7 months ago

I was facing same problem yesterday when trying to run it on Colab. I haven't check carefully but the requirements were installed without errors so I guess it was ok.

Is it normal that in the class it tries to load: UperNetForSemanticSegmentation.from_pretrained("openmmlab/upernet-swin-large")

but it main it tries to load: UperNetForSemanticSegmentation.from_pretrained("./results/")

If I make it point to "./results", it asks for a config.json files that is not present.

ZFTurbo commented 7 months ago

I was facing same problem yesterday when trying to run it on Colab. I haven't check carefully but the requirements were installed without errors so I guess it was ok.

Is it normal that in the class it tries to load: UperNetForSemanticSegmentation.from_pretrained("openmmlab/upernet-swin-large")

but it main it tries to load: UperNetForSemanticSegmentation.from_pretrained("./results/")

If I make it point to "./results", it asks for a config.json files that is not present.

If you mean this line: https://github.com/ZFTurbo/Music-Source-Separation-Training/blob/main/models/upernet_swin_transformers.py#L220

I used it only for debug purpose. It isn't used during inference run.

jarredou commented 7 months ago

Ok, I was just wondering why one was pointing to huggingface and the other local files.

ZFTurbo commented 7 months ago

@jarredou were you able to run this model?

jarredou commented 7 months ago

No, I had exact same error reported in that issue and I gave up. I'm currently training a mdx23c model with my colab account so I can't make further testing.

ZFTurbo commented 7 months ago

Ah yes! I've just remembered I did changes in transformers code. You need to change function at site-packages\transformers\models\swin\modeling_swin.py at line 312

    def forward(self, pixel_values: Optional[torch.FloatTensor]) -> Tuple[torch.Tensor, Tuple[int]]:
        _, num_channels, height, width = pixel_values.shape
        if num_channels != self.num_channels:
            # Hardcoded!
            print('Old num_channels: {} New num_channels: {}'.format(self.num_channels, num_channels))
            self.num_channels = num_channels
            if 0:
                raise ValueError(
                    "Make sure that the channel dimension of the pixel values match with the one set in the configuration."
                )
        # pad the input to be divisible by self.patch_size, if needed
        pixel_values = self.maybe_pad(pixel_values, height, width)
        embeddings = self.projection(pixel_values)
        _, _, height, width = embeddings.shape
        output_dimensions = (height, width)
        embeddings = embeddings.flatten(2).transpose(1, 2)

        return embeddings, output_dimensions

I didn't find workaround without change transformers code.

ZFTurbo / Music-Source-Separation-Training

Issue with inference on swin_upernet models: ValueError: Make sure that the channel dimension of the pixel values match with the one set in the configuration. #6