[Help requested] Inference InternVideo2_clip model.

gracikk-ds commented 4 months ago

Hello InternVideo team,

You guys have done a great job with this project!

In your paper, you use the Stage 2 model for the task of temporal grounding on QVHighlight [Lei et al., 2021] and Charade-STA [Gao et al., 2017]. I have a question, why not use the CLIP version for this purpose?

As you mentioned in one of the issues I saw, the CLIP one is fine-tuned from Stage 2 to support more applications (with the powerful InternVL text encoder).

Am I correct in understanding that you kept the video encoder model unchanged, and the BERT-L was replaced with another text encoder? If so, where can I obtain the weights for this encoder?

In the evaluation script, you use "your_model_path/internvl/internvl_c_13b_224px.pth", there is no such model in the InternVL repository.

@Andy1621

Andy1621 commented 4 months ago

Hi! The internvl_c_13b_224px can be found here. As for the previous question, I will let other co-authors answer, who is responsible for the grounding tasks.

cg1177 commented 4 months ago

@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.

gracikk-ds commented 4 months ago

@cg1177,

Thank you for your response,

Could I know the metrics you obtained with this encoder? In the preprint, you have provided the metrics for Stage2.

Feature	R1@0.5	R1@0.7	mAP	mAP	HiT@1
InternVideo2_s2-1B	70.00	54.45	47.02	42.36	69.74

tiesanguaixia commented 4 months ago

@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.

Hi, thank you for the wonderful work! So performance of the 2 subtables of Table 13 in the paper is actually finetuned from InternVideo2_clip？But why the feature is InternVideo2_s2-6B and InternVideo2_s2-1B in the table? Thank you for your guidance!

tiesanguaixia commented 4 months ago

By the way, could you please provide more detail about how to use CG-DETR as the grounding head to do the moment retrieval task?

cg1177 commented 4 months ago

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

gracikk-ds commented 4 months ago

@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)

Am I correct in understanding that to reproduce your results, I need to follow these steps:

Download the checkpoints, namely:
- InternVideo2-stage2_1b-224p-f4.pt
- 1B_clip.pth
- chinese_alpaca_lora_7b ???
- internvl_c_13b_224px.pth
Initialize the InternVideo2_CLIP class, to which a config containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.

Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?

@tiesanguaixia @Andy1621

gracikk-ds commented 4 months ago

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself.

The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?

@Andy1621 @cg1177

cg1177 commented 4 months ago

@gracikk-ds Could you explain the plot?

gracikk-ds commented 4 months ago

It is the validation curves of 'HIT@1' metric for CGDETR-like models computed on validation dataset. You've posted the same metric in your paper, but for the test set.

There are 3 curves:

curve 42_new_feat_neg_pos - model trained on stage_2 features, random state=42
curve 41_test_llama_features - model trained on stage_2_clip features provided by you, random state=41
curve 42_test_llama_features - model trained on stage_2_clip features provided by you, random state=42.

The difference remains on other metrics as well, for example, these are the results of MR mAP.

Training is not yet complete, but it is already evident that the results on stage_2 features are much better than the results on stage_2_clip.

My model a bit more powerful than CGDETR, but I want you to focus on the gap between stage2 and stage2_clip. @cg1177

tiesanguaixia commented 4 months ago

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

Thanks a lot! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own data❤️

tiesanguaixia commented 4 months ago

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself. The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?

@Andy1621 @cg1177

I have not experimented this yet.

cg1177 commented 4 months ago

@gracikk-ds I believe it is resonable. When I began to train the grounding tasks, stage_2 model was under training. So stage_2_clip 's initialization weight did not have the best video encoder. Moreover, 7B text encoder was frezon under training stage_2_clip. Both factors make stage_2 model not optimal, but still retilvely great. Instead, stage_2 model used more video-text data to train bert text encoder and video encoder. I find you have tried to use features extracted by stage_2 model for grounding tasks. Could you share your features? We can report the grounding performance of our cg-detr with you features.

gracikk-ds commented 4 months ago

Stage2 features @cg1177, try to check this features. I'll wait for results :)

gracikk-ds commented 4 months ago

Hi! :)

Is it possible for you to release a small demo on how to run the BEATs model? I want to extract audio features too. Or maybe you could give me links to the audio checkpoint that you used during training of stage2 model? Or maybe some useful tips besides this one: The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros?

It will help me a lot) Thank you!

@cg1177, @Andy1621

LarryLeeee commented 4 months ago

Stage2 features @cg1177, try to check this features. I'll wait for results :)

Hello, we've checked this features and here's the results: 1717468802961 We use the command bash cg_detr/scripts/train.sh. We simply download the features you provided and replace the original features used by CG-DETR with them. Do you need more details?@gracikk-ds

gracikk-ds commented 4 months ago

@LarryLeeee, No thanks, I got what I wanted :)

The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

LarryLeeee commented 4 months ago

@LarryLeeee, No thanks, I got what I wanted :)

The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.

gracikk-ds commented 4 months ago

@LarryLeeee, No thanks, I got what I wanted :) The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?

@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.

@LarryLeeee, I meant Intervid2 stage2.

We exploit the correspondence between video and audio, speech, and text to align InternVideo2 to semantics explicitly. In structure, though InternVideo2 has a huge video encoder, its employed audio and text encoders are relatively lightweight. The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros. For the text and speech encoders, we initialize the text encoder and multimodal decoder using Bert-Large [Devlin et al., 2018]. Specifically, we utilize the initial 19 layers of Bert-Large as the text encoder, with the subsequent 5 layers equipped with cross-attention layers serving as the multimodal decoder.

Could you provide link to audio model checkpoint?

gracikk-ds commented 4 months ago

@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio?

cg1177 commented 4 months ago

@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio? Hi, I would like to invite another co-author responsible for the audio to answer questions, which will take some time to communicate.

gracikk-ds commented 3 months ago

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.

cg1177 commented 3 months ago

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.

Ok, I urge him at once.

JustinYuu commented 3 months ago

@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.

Hello @gracikk-ds , sorry for the late reply! We only use the audio encoder to train the InternVideo2-6B model, and the InternVideo2-1B model only contains video and text encoders. Since the 6b checkpoint is still not ready to be open-sourced, we can only provide the weight of the audio encoder of the InternVideo-6B and wonder if it is acceptable. If it helps, we will provide the audio encoder checkpoint before tomorrow.

gracikk-ds commented 3 months ago

Hi @JustinYuu, thanks for your reply!

It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?

And are there any chances that the 6b model will be ready for open source by the end of this month?

Thank you!

JustinYuu commented 3 months ago

Hi @JustinYuu, thanks for your reply!

It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?

And are there any chances that the 6b model will be ready for open source by the end of this month?

Thank you!

Hi @gracikk-ds , we have provided the audio encoder of the InternVideo2-6B in the following link. You can use this model to extract audio features for your project. For the audio length, we pad audio sequences less than 10 seconds to 10 during training, yet the audio sequence used for training is usually longer than 2 seconds, thus I am not sure whether the padding strategy suits your training data. I suggest that you try both padding to 10 sec and directly put the 2-second vanilla sequence into the model to find out which option is better for your downstream scenarios. For the demo codes, you could simply refer to the BEATs model since our audio encoder is highly similar to it. A simple example is as follows:

from BEATs import BEATs, BEATsConfig
checkpoint = torch.load('yourpath/audio_6b.pth')
raw_checkpoint = torch.load('yourpath/BEATs_iter3+.pt')
cfg = BEATsConfig(raw_checkpoint['cfg'])
audio_model = BEATs(cfg)
audio_model.load_state_dict(checkpoint)
audio_model.eval()
audio_model = audio_model.cuda()
representation = audio_model(fbank)

For the 6b models, we have not decided on the open-source date yet. We will inform you once our model is publicly available. Hoping our model could help your research! :)

gracikk-ds commented 3 months ago

Thanks a lot guys! :)

gracikk-ds commented 3 months ago

@JustinYuu, one more question :)

Here is my way to prepare features for your audio model. Is it correct?

def prepare_audio_features(audio_tensor: Tensor, sample_rate: int = 16000):
    """
    Prepare audio features by normalizing the input audio tensor and applying a Log Mel spectrogram.

    Args:
        audio_tensor (Tensor): The input tensor containing the raw audio waveform.
        sample_rate (int): The sampling rate of the audio tensor. Defaults to 16000 Hz.

    Returns:
        Tensor: A tensor representing the log Mel spectrogram of the input audio.
    """
    # Define the MelSpectrogram transform
    # it's not evident which values to use for 'win_length', 'n_fft', 'hop_length', 'n_mels' and 'window_fn'
    # In your paper: 
    # win_length=400 - Equivalent to 25ms window size at 16kHz
    # n_fft=??? It could be the next power of two from window length
    # n_fft = 2 ** math.ceil(math.log2(window_length_samples)) = 512
    # n_mels = 64, 
    # window_fn = hamming_window
    # hop_length=???. I'm using 200 as 0.5 overlap
    # But the BEATs article uses different parameter values.

    mel_spectrogram = torchaudio.transforms.MelSpectrogram(
        sample_rate=sample_rate,
        win_length=400, 
        n_fft=512,
        hop_length=200,
        n_mels=64,  # Number of Mel bands
        window_fn=torch.hamming_window,
    )

    # Apply the transform to get the Mel spectrogram
    mel_spectrogram = mel_spectrogram(audio_tensor)

    # Convert to log scale
    log_mel_spectrogram = torch.log(mel_spectrogram + EPS)  # Add a small value to avoid log(0)

    # Based on the BEATs paper the acoustic feature is normalized to the mean value of 0 and standard deviation of 0.5
    # But which values should I use for mean and std???
    log_mel_spectrogram = (log_mel_spectrogram - log_mel_spectrogram.mean()) / ( log_mel_spectrogram.std() * 2)

    return log_mel_spectrogram

waveform, sample_rate = torchaudio.load("your_audio_file.wav")

# Apply effects to get the desired sample rate and number of channels
waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor(
    waveform,
    sample_rate,
    effects=[["rate", "16000"], ["channels", "1"]],
)

fbank = prepare_audio_features(waveform, sample_rate)

And here is the original BEATs preprocessing step:

    def preprocess(
        self,
        source: torch.Tensor,
        fbank_mean: float = 15.41663,
        fbank_std: float = 6.55582,
    ) -> torch.Tensor:
        fbanks = []
        for waveform in source:
            waveform = waveform.unsqueeze(0) * 2**15
            fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
            fbanks.append(fbank)
        fbank = torch.stack(fbanks, dim=0)
        fbank = (fbank - fbank_mean) / (2 * fbank_std)
        return fbank

Also I've got question about model forward pass. Here is the forwrd pass of the BEATs model and my output shapes.

    def forward(
        self,
        source: torch.Tensor,
        padding_mask: Optional[torch.Tensor] = None,
        fbank_mean: float = 15.41663,
        fbank_std: float = 6.55582,
    ):
        """Forward pass for the BEATs model.

        Args:
            source (torch.Tensor): Input tensor.
            padding_mask (Optional[torch.Tensor]): Padding mask tensor. Defaults to None.
            fbank_mean (float): Mean value for feature normalization. Defaults to 15.41663.
            fbank_std (float): Standard deviation for feature normalization. Defaults to 6.55582.

        Returns:
            torch.Tensor: Model output tensor.
        """
        #  source.shape = [32, 32000] 16k per second
        # prepare audio feature using original BEATs preaprator gives me output shape: [32, 198, 128]
        fbank = prepare_audio_features_old(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
        #  And I get output shape [32, 64, 161] using my function 
        my_fbank = prepare_audio_features(source)

        if padding_mask is not None:
            padding_mask = self.forward_padding_mask(fbank, padding_mask)

        fbank = fbank.unsqueeze(1) # [32, 192, 128] -> [32, 1, 192, 128]
        features = self.patch_embedding(fbank)  # [32, 1, 192, 128] -> [32, 512, 12, 8]
        features = features.reshape(features.shape[0], features.shape[1], -1)  # [32, 512, 12, 8] -> [32, 512, 96]
        features = features.transpose(1, 2)  # [32, 512, 96]-> [32, 96, 512]
        features = self.layer_norm(features)

        if padding_mask is not None:
            padding_mask = self.forward_padding_mask(features, padding_mask)

        features = self.post_extract_proj(features)  # [32, 96, 512] -> [32, 96, 768]
        x = self.dropout_input(features)
        x, _ = self.encoder(x, padding_mask=padding_mask)  # [32, 96, 768] -> [32, 96, 768]
        return x, padding_mask

And here is the question, what should I do next with emb of shape [32, 96, 768] to get [32, 768]?

gracikk-ds commented 3 months ago

@cg1177, could you summon @JustinYuu one more time? :DD

takfate commented 3 months ago

@cg1177, could you summon @JustinYuu one more time? :DD

OK

JustinYuu commented 3 months ago

@takfate Hi, sry for the late response! For the preprocessing, you can directly use the preprocess function in BEATs:

audio_input_16khz, _ = librosa.load(audio_path, sr=sr)
# padding if you need
fbank = audio_model.preprocess(audio_input_16khz).cuda()

For the dimension of beats output, we use an average pooling layer to squeeze the temporal dimension to 1:

representation = audio_model(fbank)
pooled_audio_embeds = representation.mean(dim=1)

nickyzhi commented 3 months ago

@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)

Am I correct in understanding that to reproduce your results, I need to follow these steps:

Download the checkpoints, namely:

InternVideo2-stage2_1b-224p-f4.pt

1B_clip.pth

chinese_alpaca_lora_7b ???

internvl_c_13b_224px.pth

Initialize the InternVideo2_CLIP class, to which a config containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.

Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?

@tiesanguaixia @Andy1621

hey @gracikk-ds have you figured out this already?

I downloaded the 7b for text encoder from here https://huggingface.co/hfl/chinese-alpaca-lora-7b/tree/main but it seems it doesn't have a config file, could you clarify if this model is you used for CLIP inference? @Andy1621 @tiesanguaixia @JustinYuu if so, could you share the config file with it?
I successfully loaded the internvl_c_13b_224px and stage2 1b, but found no where to load the 1b_clip.pth file, any instruction? the eval script doesn't seem to include this either https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot/1B/config_charades_mc.py

XiaohuJoshua commented 2 months ago

@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.

Thanks a lot! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own data❤️

Excuse me, did you get the relevant code? I encountered a similar problem, extracting multimodal features for cg-detr.

Divyanshupy commented 1 month ago

@nickyzhi Did you get any update? I was trying to find config file or guidance on how to load the InternVideo1B stage 2 clip model similar to how it is done in the demo for InternVideo1Bstage2. Thanks in advance.

Divyanshupy commented 1 month ago

@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)

Am I correct in understanding that to reproduce your results, I need to follow these steps:
1. Download the checkpoints, namely:

   * InternVideo2-stage2_1b-224p-f4.pt
   * 1B_clip.pth
   * chinese_alpaca_lora_7b ???
   * internvl_c_13b_224px.pth

2. Initialize the [InternVideo2_CLIP class](https://github.com/OpenGVLab/InternVideo/blob/049860e1d7e5bbaddcb6064578906b88424d80c2/InternVideo2/multi_modality/models/internvideo2_clip.py#L16), to which a [config](https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot/1B/config_charades_mc.py) containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.
Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?

@tiesanguaixia @Andy1621

Hey, can you share the config file you used to load the clip model or if you have a code that helps to load it similar to demo_notebook? I am currently confused with the chinese_alpaca path and with the intern_vl model regarding where it is loaded? Any help is highly appreciated.

OpenGVLab / InternVideo

[Help requested] Inference InternVideo2_clip model. #129