Extraction of features with AV HuBERT

shakeel608 commented 1 year ago

The tutorial mentioned for feature extraction. Are these the learned representations of AV-HuBERT or just extracting the features from input video file which needs to be passed to the AV HuBERT model ?

If these are learned representations, are these the one common representations across audio-visual modality

and how to extract learned features from a specific layer of AV HuBERT model?

chevalierNoir commented 1 year ago

Hi,

Are these the learned representations of AV-HuBERT or just extracting the features from input video file which needs to be passed to the AV HuBERT model ?

If these are learned representations, are these the one common representations across audio-visual modality

The example in the colab illustrates how to extract visual feature, which are output from last layer of our model given the video input. In this example, the audio input is None and thus this is just for the visual modality.

and how to extract learned features from a specific layer of AV HuBERT model?

You can change values of the output_layer argument in this function.

shakeel608 commented 1 year ago

@chevalierNoir

Thank you for your answer I am using your script to extract audio visual features

After extracting log filterbank using python_speech-features of shape (96,26) and frame shape is (96,88,88) (so length is synchronised here which is 96 in each modality)

It is throwing following error from hubert.py", line 327, in forward

  x= self.proj(x.transpose(1, 2)) 
numpy.AxisError: axis 2 is out of bounds for array of dimension 2

by using following command

feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_feat}, padding_mask=None, output_layer=None)

If I keep audio: None, then it is working perfectly fine, but only problem is If I want to extract both modality features

Also the input dimension of audio encode is 104 in the AV-HuBERT model

RuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 96] but got: [1, 104]

AVHubertModel( (feature_extractor_audio): SubModel( (proj): Linear(in_features=104, out_features=768, bias=True) ) ....

How did you specify this 104 dimension of the audio modality. It means then every audio will be forced to have ths feature dimension of 104 in order to process it using AV HuBERT for embedding extraction ?

chevalierNoir commented 1 year ago

Hi,

Looks like you missed the batch dimension (e.g., (96,88,88) should be (1,96,88,88)). For how to prepare audio feature, please refer the to these lines (1, 2) in hubert_dataset.py

david-gimeno commented 1 year ago

Although it can be a bit late, that 104 refers to the number of features composing the audio input tensor. In order to make the model work with audio stream data, you have to stack every 4 frames into 1 in order to better align the acoustic cues to the visual ones. This means adding:

stack_order_audio = 4
audio_feats = stacker(audio_feats, stack_order_audio)

in your implementation when processing the audio waveform. In this way, you will obtain an acoustic input tensor of shape (time,104), and then you will only need to add a fake batch dimension of 1 sample and do a permutation to obtain the resulting shape of (batch, 104, time) as following:`

audio_feats.unsqueeze(dim=0).permute(0,2,1).cuda()`

In my opinion, we were confused because in

https://github.com/facebookresearch/av_hubert/blob/main/avhubert/hubert_dataset.py#L287C13-L287C71

that self.stack_order_audio variable is set to 1 by default but it is for sure modified in some configuration file.

I hope it solves your issue :) Best regards.

shakeel608 commented 5 months ago

If I want to get audio only features, what should be the format for audio input. Should it be a simple raw speech of shape (L,)
wav_file shape is (15008,) L =15008 for example My second question is
I have an audio and visual frames with audio raw data shape of (L, ) after using audio_feats = logfbank(wav_data, samplerate=sr).astype(np.float32) with audio_feats.shape = (93, 26)--> Is 26 here number of temporal frames and 93 are fbanks ? After I pass audio_feats = self.stacker(audio_feats, stack_order_audio=4)

I get audio_feat shape of (24, 104)

feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)

If I extract only visual features from frames of shape torch.Size([1, 1, 17, 88, 88]), I get output shape torch.Size([17, 1024])

but when I pass both audio as well as video modality as

feature, _ = self.model.extract_finetune(source={'video': frames, 'audio': audio_feats}, padding_mask=None, output_layer=None)

I get the same error


 x= self.proj(x.transpose(1, 2)) 
numpy.AxisError: axis 2 is out of bounds for array of dimension 2

One more question if I want to pass only audio features what should be its input format

I tried to pass 'audio_feats' only and keeping 'video': None It is throwing same error as above

Can you please help it out

shakeel608 commented 5 months ago

I have figured it out while extracting features separately for audio and video

visual frames:

torch.Size([1, 1, L, X, Y]) audio_feats should be (batch, 104, L) while usingstack_order_audio=4 Since my video and audio frames have different length visual frame is 17 L1 and audio frames L2

Separately, I can extract either audio or visual features but because of length mismatch I am not able to extract features

How to synchronize the lengths of audio and visual feature to have exact same L

And How to specify a particular layer for feature extraction in

feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)

What should be passed to output_layer to have specific layer features

david-gimeno commented 5 months ago

Diving into my code scripts, I found that I did this padding trick:

diff = len(audio_feats) - len(video_frames)
if diff < 0:
    audio_feats = np.concatenate([
        audio_feats, np.zeros([-diff, audio_feats.shape[-1]],
        dtype=audio_feats.dtype),
    ])
elif diff > 0:
    audio_feats = audio_feats[:-diff]

The idea behind this is that, given the number of frames of my videoclip, i compute the difference w.r.t. my 4-frame-stacked audio features. If the audio features are shorter than the video, then I add some zero padding by concatenation. Conversely, if the audio features are longer, I discard the last frames hoping it is not significantly affecting the quality of the extracted embeddings. I have to say that I remember that the quality of my experiments was quite good even though this padding trick. Finally, if there is no difference, everything is perfect :)

I do not know if this code is mine or if I found it in the AV-HuBERT official implementation. I hope it helps you!

Warning Take into account the frame rate of your videoclips. If it is 50fps you should stack every 2 frames of the audio sequence feature. Why? Audio is typically extracted at 100fps, so a stacking strategy of 100fps/50fps = 2 frames should be enough to ideally align both cues (although sometimes we need to zero pad, as discussed above). The reason why AV-HuBERT is stacking 4 frames is because usually, as it is the case of LRS3, video are recorded at 25fps.

Best regards, David.

shakeel608 commented 5 months ago

Thank you so much @david-gimeno , it works, the dataset I am currently working on is also having 25 fps framerate

Regarding other question

And How to specify a particular layer for feature extraction in

feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)

What should be passed to output_layer to have specific layer features

because I tried to pass any number and it accepted any number. I just want to confirm what should we pass to output_layer so that I extract embeddings from a specific layer

david-gimeno commented 5 months ago

I am glad it worked :) However, I've never tried to extract features from a specific intermediate layer of AV-HuBERT, so in this case I cannot help you. I guess you should open another issue or inspect the forward method of the architecture and then modify or find out how to return what you are looking for.

Best regards, David.

shakeel608 commented 5 months ago

Thank you so much I will open the new issue

david commented 5 months ago

Thank you so much @david, it works

My pleasure, let me know if you need anything else, and I'll be sure to ask @david-gimeno to do it.

shakeel608 commented 5 months ago

@david-gimeno Just a quick question What do you mean by Audio is typically extracted at 100fps

david-gimeno commented 5 months ago

In the field of ASR, the audio raw waveform is processed to extract the well-established Mel Frequency Cepstral Coefficients (MFCCs). I recommend you to read more about how these audio features are computed. In general words, and regarding the sample rate, we are processing the entire waveform with a sliding window with overlap. One of the most prevalent approaches is to use windows of 25ms with a step of 10ms, resulting in features extracted at approximately 100 fps. This processing methods is kind of a standard in the field. However, as expected, new methods and new feature extraction techniques have been explored. This paper explains the conventional HMM-based systems, if you want to learn more about the origins of ASR.

BaochaoZhu commented 4 months ago

Is the visual feature extracted by AV-HuBERT a generic video feature?Can I use AV-HuBERT pre-trained on English to extract Chinese video features? @chevalierNoir

facebookresearch / av_hubert

Extraction of features with AV HuBERT #85