Open zuujhyt opened 2 years ago
Hi,
source['audio']
is the log filterbank, see here. Also it should be normalized like here when the task.cfg.normalize=true
, which I believe is the case for all the models we release. Besides, source['audio']
should be of the same sequence length as source['video']
before feeding into the model, as we assume the audio and video are synchronized.
Hello,
I am using your script to extract audio visual features
After extracting log filterbank using python_speech-features
of shape (96,26) and frame shape is (96,88,88)
It is throwing following error from hubert.py", line 327, in forward
numpy.AxisError: axis 2 is out of bounds for array of dimension 2 ```
by using following command
`feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_feat}, padding_mask=None, output_layer=None)`
Hi, thank you for the work and the colab. In the colab, the following code snippet shows how to extract visual features.
I wonder how I can extract audio-visual features? can you please give an example? or specifically what to feed into the
source['audio']
? Is it a normalized [-1,1] waveform? or othter sprectral features? Thank you.