JuanFMontesinos / VoViT

VoViT: Low Latency Graph-based Audio-Visual VoiceSeparation Transformer
https://ipcv.github.io/VoViT/
34 stars 9 forks source link

Dimensional Error on Forward Pass #8

Open SarrocaGSergi opened 1 year ago

SarrocaGSergi commented 1 year ago

After adjusting the code to make it run, I find a dimensional error on the forward pass:

Creating model instance...
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/lazy.py:180: UserWarning: Lazy modules are a new feature under heavy development so changes to the API or functionality can happen at any moment.
  warnings.warn('Lazy modules are a new feature under heavy development '
VoViT pre-trained weights loaded
Lead Voice enhancer pre-trained weights loaded
Done
Forwarding speaker1...
/usr/local/lib/python3.10/dist-packages/torchaudio/functional/functional.py:109: UserWarning: `return_complex` argument is now deprecated and is not effective.`torchaudio.functional.spectrogram(power=None)` always returns a tensor with complex dtype. Please remove the argument in the function call.
  warnings.warn(
/content/VoViT/vovit/core/models/production_model.py:102: UserWarning: Casting complex values to real discards the imaginary part (Triggered internally at ../aten/src/ATen/native/Copy.cpp:276.)
  return s.to(dtype)
---------------------------------------------------------------------------
EinopsError                               Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/einops/einops.py](https://localhost:8080/#) in reduce(tensor, pattern, reduction, **axes_lengths)
    411         recipe = _prepare_transformation_recipe(pattern, reduction, axes_lengths=hashable_axes_lengths)
--> 412         return _apply_recipe(recipe, tensor, reduction_type=reduction)
    413     except EinopsError as e:

15 frames
[/usr/local/lib/python3.10/dist-packages/einops/einops.py](https://localhost:8080/#) in _apply_recipe(recipe, tensor, reduction_type)
    234     init_shapes, reduced_axes, axes_reordering, added_axes, final_shapes = \
--> 235         _reconstruct_from_shape(recipe, backend.shape(tensor))
    236     tensor = backend.reshape(tensor, init_shapes)

[/usr/local/lib/python3.10/dist-packages/einops/einops.py](https://localhost:8080/#) in _reconstruct_from_shape_uncached(self, shape)
    164         if len(shape) != len(self.input_composite_axes):
--> 165             raise EinopsError('Expected {} dimensions, got {}'.format(len(self.input_composite_axes), len(shape)))
    166 

EinopsError: Expected 4 dimensions, got 3

During handling of the above exception, another exception occurred:

EinopsError                               Traceback (most recent call last)
[<ipython-input-7-faaa648e3dcd>](https://localhost:8080/#) in <cell line: 28>()
     28 with torch.no_grad():
     29     print('Forwarding speaker1...')
---> 30     pred_s1 = model.forward_unlimited(mixture, speaker1_face)
     31     print('Forwarding speaker2...')
     32     pred_s2 = model.forward_unlimited(mixture, speaker2_face)

[/content/VoViT/vovit/__init__.py](https://localhost:8080/#) in forward_unlimited(self, mixture, visuals)
     78         visuals = visuals[:n_chunks * fps * 2].view(n_chunks, fps * 2, 3, 68)
     79         mixture = mixture[:n_chunks * length].view(n_chunks, -1)
---> 80         pred = self.forward(mixture, visuals)
     81         pred_unraveled = {}
     82         for k, v in pred.items():

[/content/VoViT/vovit/__init__.py](https://localhost:8080/#) in forward(self, mixture, visuals, extract_landmarks)
     56         mixture /= mixture.abs().max()
     57 
---> 58         return self.vovit(mixture, ld)
     59 
     60     def forward_unlimited(self, mixture, visuals):

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in forward(self, mixture, landmarks)
    378         """
    379         inputs = {'src': mixture, 'landmarks': landmarks}
--> 380         return self.avse(inputs)

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in forward(self, *args, **kwargs)
    325 
    326     def forward(self, *args, **kwargs):
--> 327         return self.inference(*args, **kwargs)
    328 
    329     def inference(self, inputs: dict, n_iter=1):

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in inference(self, inputs, n_iter)
    329     def inference(self, inputs: dict, n_iter=1):
    330         with torch.no_grad():
--> 331             output = self.forward_avse(inputs, compute_istft=False)
    332             estimated_sp = output['estimated_sp']
    333             for i in range(n_iter):

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in forward_avse(self, inputs, compute_istft)
    321     def forward_avse(self, inputs, compute_istft: bool):
    322         self.av_se.eval()
--> 323         output = self.av_se(inputs, compute_wav=compute_istft)
    324         return output
    325 

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in forward(self, inputs, compute_wav)
    223         # ==========================================
    224 
--> 225         audio_feats = self.audio_processor.preprocess_audio(inputs['src'])
    226 
    227         """

[/content/VoViT/vovit/core/models/production_model.py](https://localhost:8080/#) in preprocess_audio(self, n_sources, *src)
    135             # Contiguous required to address memory problems in certain gpus
    136             sp_mix = sp_mix_raw[:, ::2, ...].contiguous()  # BxFxTx2
--> 137         x = rearrange(sp_mix, 'b f t c -> b c f t')
    138         output = {'mixture': x, 'sp_mix_raw': sp_mix_raw}
    139 

[/usr/local/lib/python3.10/dist-packages/einops/einops.py](https://localhost:8080/#) in rearrange(tensor, pattern, **axes_lengths)
    481             raise TypeError("Rearrange can't be applied to an empty list")
    482         tensor = get_backend(tensor[0]).stack_on_zeroth_dimension(tensor)
--> 483     return reduce(cast(Tensor, tensor), pattern, reduction='rearrange', **axes_lengths)
    484 
    485 

[/usr/local/lib/python3.10/dist-packages/einops/einops.py](https://localhost:8080/#) in reduce(tensor, pattern, reduction, **axes_lengths)
    418             message += '\n Input is list. '
    419         message += 'Additional info: {}.'.format(axes_lengths)
--> 420         raise EinopsError(message + '\n {}'.format(e))
    421 
    422 

EinopsError:  Error while processing rearrange-reduction pattern "b f t c -> b c f t".
 Input tensor shape: torch.Size([4, 256, 128]). Additional info: {}.
 Expected 4 dimensions, got 3

This error comes either in the colab notebook and when I clone the repo in my local. The main changes I did is to reformat the requirements to use newer versions of pytorch packages and cuda and fix the bugs generated by np.int What do you suggest?

JuanFMontesinos commented 1 year ago

Hi, It's hard to know without looking into the changed you applied. I just know it doesn't work with newer pytorch versions but forgot why. Could you point me to a public repo to inspect the commits and the code changed?

ioyy900205 commented 1 year ago
def wav2sp(self, x):
        # CUDNN does not support half complex numbers for non-power2 windows
        # Casting to float32 is a workaround
        dtype = x.dtype
        x = x.float()
        s = spectrogram(x, pad=0, window=self._window.float(), win_length=self._n_fft,
                        n_fft=self._n_fft, hop_length=self._hop_length,
                        power=None, normalized=False, return_complex=False)
        return torch.view_as_real(s)

will work. main reason is the different version of torch.