Open shakeel608 opened 1 year ago
Hi,
Are these the learned representations of AV-HuBERT or just extracting the features from input video file which needs to be passed to the AV HuBERT model ?
If these are learned representations, are these the one common representations across audio-visual modality
The example in the colab illustrates how to extract visual feature, which are output from last layer of our model given the video input. In this example, the audio input is None
and thus this is just for the visual modality.
and how to extract learned features from a specific layer of AV HuBERT model?
You can change values of the output_layer
argument in this function.
@chevalierNoir
Thank you for your answer I am using your script to extract audio visual features
After extracting log filterbank using python_speech-features of shape (96,26) and frame shape is (96,88,88) (so length is synchronised here which is 96 in each modality)
It is throwing following error from hubert.py", line 327, in forward
x= self.proj(x.transpose(1, 2))
numpy.AxisError: axis 2 is out of bounds for array of dimension 2
by using following command
feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_feat}, padding_mask=None, output_layer=None)
If I keep audio: None
, then it is working perfectly fine, but only problem is If I want to extract both modality features
Also the input dimension of audio encode is 104 in the AV-HuBERT model
RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 96] but got: [1, 104]
AVHubertModel( (feature_extractor_audio): SubModel( (proj): Linear(in_features=104, out_features=768, bias=True) )
....
How did you specify this 104 dimension of the audio modality. It means then every audio will be forced to have ths feature dimension of 104 in order to process it using AV HuBERT for embedding extraction ?
Although it can be a bit late, that 104 refers to the number of features composing the audio input tensor. In order to make the model work with audio stream data, you have to stack every 4 frames into 1 in order to better align the acoustic cues to the visual ones. This means adding:
stack_order_audio = 4
audio_feats = stacker(audio_feats, stack_order_audio)
in your implementation when processing the audio waveform. In this way, you will obtain an acoustic input tensor of shape (time,104)
, and then you will only need to add a fake batch dimension of 1 sample and do a permutation to obtain the resulting shape of (batch, 104, time)
as following:`
audio_feats.unsqueeze(dim=0).permute(0,2,1).cuda()`
In my opinion, we were confused because in
https://github.com/facebookresearch/av_hubert/blob/main/avhubert/hubert_dataset.py#L287C13-L287C71
that self.stack_order_audio
variable is set to 1 by default but it is for sure modified in some configuration file.
I hope it solves your issue :) Best regards.
audio_feats = logfbank(wav_data, samplerate=sr).astype(np.float32)
with audio_feats.shape = (93, 26)--> Is 26 here number of temporal frames and 93 are fbanks ?
After I pass
audio_feats = self.stacker(audio_feats, stack_order_audio=4)
I get audio_feat shape of (24, 104)
feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
If I extract only visual features from frames of shape torch.Size([1, 1, 17, 88, 88])
, I get output shape torch.Size([17, 1024])
but when I pass both audio as well as video modality as
feature, _ = self.model.extract_finetune(source={'video': frames, 'audio': audio_feats}, padding_mask=None, output_layer=None)
I get the same error
x= self.proj(x.transpose(1, 2))
numpy.AxisError: axis 2 is out of bounds for array of dimension 2
One more question if I want to pass only audio features what should be its input format
I tried to pass 'audio_feats
' only and keeping 'video': None
It is throwing same error as above
Can you please help it out
I have figured it out while extracting features separately for audio and video
visual frames:
torch.Size([1, 1, L, X, Y])
audio_feats should be (batch, 104, L) while using
stack_order_audio=4
Since my video and audio frames have different length visual frame is 17 L1 and audio frames L2
Separately, I can extract either audio or visual features but because of length mismatch I am not able to extract features
How to synchronize the lengths of audio and visual feature to have exact same L
And How to specify a particular layer for feature extraction in
feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
What should be passed to output_layer
to have specific layer features
Diving into my code scripts, I found that I did this padding trick:
diff = len(audio_feats) - len(video_frames)
if diff < 0:
audio_feats = np.concatenate([
audio_feats, np.zeros([-diff, audio_feats.shape[-1]],
dtype=audio_feats.dtype),
])
elif diff > 0:
audio_feats = audio_feats[:-diff]
The idea behind this is that, given the number of frames of my videoclip, i compute the difference w.r.t. my 4-frame-stacked audio features. If the audio features are shorter than the video, then I add some zero padding by concatenation. Conversely, if the audio features are longer, I discard the last frames hoping it is not significantly affecting the quality of the extracted embeddings. I have to say that I remember that the quality of my experiments was quite good even though this padding trick. Finally, if there is no difference, everything is perfect :)
I do not know if this code is mine or if I found it in the AV-HuBERT official implementation. I hope it helps you!
Warning Take into account the frame rate of your videoclips. If it is 50fps you should stack every 2 frames of the audio sequence feature. Why? Audio is typically extracted at 100fps, so a stacking strategy of 100fps/50fps = 2 frames should be enough to ideally align both cues (although sometimes we need to zero pad, as discussed above). The reason why AV-HuBERT is stacking 4 frames is because usually, as it is the case of LRS3, video are recorded at 25fps.
Best regards, David.
Thank you so much @david-gimeno , it works, the dataset I am currently working on is also having 25 fps framerate
Regarding other question
And How to specify a particular layer for feature extraction in
feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
What should be passed to output_layer to have specific layer features
because I tried to pass any number and it accepted any number. I just want to confirm what should we pass to output_layer
so that I extract embeddings from a specific layer
I am glad it worked :) However, I've never tried to extract features from a specific intermediate layer of AV-HuBERT, so in this case I cannot help you. I guess you should open another issue or inspect the forward method of the architecture and then modify or find out how to return what you are looking for.
Best regards, David.
Thank you so much I will open the new issue
Thank you so much @david, it works
My pleasure, let me know if you need anything else, and I'll be sure to ask @david-gimeno to do it.
@david-gimeno Just a quick question
What do you mean by Audio is typically extracted at 100fps
In the field of ASR, the audio raw waveform is processed to extract the well-established Mel Frequency Cepstral Coefficients (MFCCs). I recommend you to read more about how these audio features are computed. In general words, and regarding the sample rate, we are processing the entire waveform with a sliding window with overlap. One of the most prevalent approaches is to use windows of 25ms with a step of 10ms, resulting in features extracted at approximately 100 fps. This processing methods is kind of a standard in the field. However, as expected, new methods and new feature extraction techniques have been explored. This paper explains the conventional HMM-based systems, if you want to learn more about the origins of ASR.
Is the visual feature extracted by AV-HuBERT a generic video feature?Can I use AV-HuBERT pre-trained on English to extract Chinese video features? @chevalierNoir
The tutorial mentioned for feature extraction. Are these the learned representations of AV-HuBERT or just extracting the features from input video file which needs to be passed to the AV HuBERT model ?
If these are learned representations, are these the one common representations across audio-visual modality
and how to extract learned features from a specific layer of AV HuBERT model?