Open gracikk-ds opened 4 months ago
Hi! The internvl_c_13b_224px
can be found here. As for the previous question, I will let other co-authors answer, who is responsible for the grounding tasks.
@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.
@cg1177,
Thank you for your response,
Could I know the metrics you obtained with this encoder? In the preprint, you have provided the metrics for Stage2.
Feature | R1@0.5 | R1@0.7 | mAP | mAP | HiT@1 |
---|---|---|---|---|---|
InternVideo2_s2-1B | 70.00 | 54.45 | 47.02 | 42.36 | 69.74 |
@gracikk-ds Hello. We did use InternVL text encoder with 7B parameters for grounding tasks.
Hi, thank you for the wonderful work! So performance of the 2 subtables of Table 13 in the paper is actually finetuned from InternVideo2_clip
?But why the feature is InternVideo2_s2-6B
and InternVideo2_s2-1B
in the table? Thank you for your guidance!
By the way, could you please provide more detail about how to use CG-DETR as the grounding head to do the moment retrieval task?
@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.
@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)
Am I correct in understanding that to reproduce your results, I need to follow these steps:
Download the checkpoints, namely:
Initialize the InternVideo2_CLIP class, to which a config containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.
Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?
@tiesanguaixia @Andy1621
@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.
I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself.
The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?
@Andy1621 @cg1177
@gracikk-ds Could you explain the plot?
It is the validation curves of 'HIT@1' metric for CGDETR-like models computed on validation dataset. You've posted the same metric in your paper, but for the test set.
There are 3 curves:
The difference remains on other metrics as well, for example, these are the results of MR mAP.
Training is not yet complete, but it is already evident that the results on stage_2 features are much better than the results on stage_2_clip.
My model a bit more powerful than CGDETR, but I want you to focus on the gap between stage2 and stage2_clip. @cg1177
@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.
Thanks a lot! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own data❤️
@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.
I've tried to train CGDETR based model on stage2_clip features that you have released and on stage2 features extracted by myself. The difference is huge. Have you experimented with stage2 features? Or maybe do you made some changes in CGDETR to make in perform better on stage2_clip features?
@Andy1621 @cg1177
I have not experimented this yet.
@gracikk-ds I believe it is resonable. When I began to train the grounding tasks, stage_2 model was under training. So stage_2_clip 's initialization weight did not have the best video encoder. Moreover, 7B text encoder was frezon under training stage_2_clip. Both factors make stage_2 model not optimal, but still retilvely great. Instead, stage_2 model used more video-text data to train bert text encoder and video encoder. I find you have tried to use features extracted by stage_2 model for grounding tasks. Could you share your features? We can report the grounding performance of our cg-detr with you features.
Stage2 features @cg1177, try to check this features. I'll wait for results :)
Hi! :)
Is it possible for you to release a small demo on how to run the BEATs model? I want to extract audio features too. Or maybe you could give me links to the audio checkpoint that you used during training of stage2 model? Or maybe some useful tips besides this one: The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros?
It will help me a lot) Thank you!
@cg1177, @Andy1621
Stage2 features @cg1177, try to check this features. I'll wait for results :)
Hello, we've checked this features and here's the results: We use the command bash cg_detr/scripts/train.sh. We simply download the features you provided and replace the original features used by CG-DETR with them. Do you need more details?@gracikk-ds
@LarryLeeee, No thanks, I got what I wanted :)
The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?
@LarryLeeee, No thanks, I got what I wanted :)
The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?
@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.
@LarryLeeee, No thanks, I got what I wanted :) The last question I'm wondering is whether you are using an audio modality to train the stage2 or not. And which checkpoint should I take to extract audio features?
@gracikk-ds We did not use an audio modality, and you can refer to https://github.com/wjun0830/CGDETR for more details.
@LarryLeeee, I meant Intervid2 stage2.
We exploit the correspondence between video and audio, speech, and text to align InternVideo2 to semantics explicitly. In structure, though InternVideo2 has a huge video encoder, its employed audio and text encoders are relatively lightweight. The used audio encoder is a 12-layer transformer initialized with BEATs [Chen et al., 2023d] (90M). It takes in audio features, which are 64-dimensional log Mel filterbank spectrograms using a 25ms Hamming window, transformed from 10-second-long clips, padding with zeros. For the text and speech encoders, we initialize the text encoder and multimodal decoder using Bert-Large [Devlin et al., 2018]. Specifically, we utilize the initial 19 layers of Bert-Large as the text encoder, with the subsequent 5 layers equipped with cross-attention layers serving as the multimodal decoder.
Could you provide link to audio model checkpoint?
@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio?
@Andy1621, @cg1177, @LarryLeeee, hi! Any comments about the audio? Hi, I would like to invite another co-author responsible for the audio to answer questions, which will take some time to communicate.
@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.
@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.
Ok, I urge him at once.
@cg1177, we are limited in time, the conference submission deadline is approaching. Do you have a rough idea of how long it will take to communicate with co-author? We need to pick the audio model this week.
Hello @gracikk-ds , sorry for the late reply! We only use the audio encoder to train the InternVideo2-6B model, and the InternVideo2-1B model only contains video and text encoders. Since the 6b checkpoint is still not ready to be open-sourced, we can only provide the weight of the audio encoder of the InternVideo-6B and wonder if it is acceptable. If it helps, we will provide the audio encoder checkpoint before tomorrow.
Hi @JustinYuu, thanks for your reply!
It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?
And are there any chances that the 6b model will be ready for open source by the end of this month?
Thank you!
Hi @JustinYuu, thanks for your reply!
It is better than nothing. We look forward to checkpoints :) And if you provide simple demo of how to run the model, that would be perfect! In the case of the QVHighlights dataset, we have videos that are 2 seconds long, should we pad them to 10 seconds?
And are there any chances that the 6b model will be ready for open source by the end of this month?
Thank you!
Hi @gracikk-ds , we have provided the audio encoder of the InternVideo2-6B in the following link. You can use this model to extract audio features for your project. For the audio length, we pad audio sequences less than 10 seconds to 10 during training, yet the audio sequence used for training is usually longer than 2 seconds, thus I am not sure whether the padding strategy suits your training data. I suggest that you try both padding to 10 sec and directly put the 2-second vanilla sequence into the model to find out which option is better for your downstream scenarios. For the demo codes, you could simply refer to the BEATs model since our audio encoder is highly similar to it. A simple example is as follows:
from BEATs import BEATs, BEATsConfig
checkpoint = torch.load('yourpath/audio_6b.pth')
raw_checkpoint = torch.load('yourpath/BEATs_iter3+.pt')
cfg = BEATsConfig(raw_checkpoint['cfg'])
audio_model = BEATs(cfg)
audio_model.load_state_dict(checkpoint)
audio_model.eval()
audio_model = audio_model.cuda()
representation = audio_model(fbank)
For the 6b models, we have not decided on the open-source date yet. We will inform you once our model is publicly available. Hoping our model could help your research! :)
Thanks a lot guys! :)
@JustinYuu, one more question :)
Here is my way to prepare features for your audio model. Is it correct?
def prepare_audio_features(audio_tensor: Tensor, sample_rate: int = 16000):
"""
Prepare audio features by normalizing the input audio tensor and applying a Log Mel spectrogram.
Args:
audio_tensor (Tensor): The input tensor containing the raw audio waveform.
sample_rate (int): The sampling rate of the audio tensor. Defaults to 16000 Hz.
Returns:
Tensor: A tensor representing the log Mel spectrogram of the input audio.
"""
# Define the MelSpectrogram transform
# it's not evident which values to use for 'win_length', 'n_fft', 'hop_length', 'n_mels' and 'window_fn'
# In your paper:
# win_length=400 - Equivalent to 25ms window size at 16kHz
# n_fft=??? It could be the next power of two from window length
# n_fft = 2 ** math.ceil(math.log2(window_length_samples)) = 512
# n_mels = 64,
# window_fn = hamming_window
# hop_length=???. I'm using 200 as 0.5 overlap
# But the BEATs article uses different parameter values.
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
sample_rate=sample_rate,
win_length=400,
n_fft=512,
hop_length=200,
n_mels=64, # Number of Mel bands
window_fn=torch.hamming_window,
)
# Apply the transform to get the Mel spectrogram
mel_spectrogram = mel_spectrogram(audio_tensor)
# Convert to log scale
log_mel_spectrogram = torch.log(mel_spectrogram + EPS) # Add a small value to avoid log(0)
# Based on the BEATs paper the acoustic feature is normalized to the mean value of 0 and standard deviation of 0.5
# But which values should I use for mean and std???
log_mel_spectrogram = (log_mel_spectrogram - log_mel_spectrogram.mean()) / ( log_mel_spectrogram.std() * 2)
return log_mel_spectrogram
waveform, sample_rate = torchaudio.load("your_audio_file.wav")
# Apply effects to get the desired sample rate and number of channels
waveform, sample_rate = torchaudio.sox_effects.apply_effects_tensor(
waveform,
sample_rate,
effects=[["rate", "16000"], ["channels", "1"]],
)
fbank = prepare_audio_features(waveform, sample_rate)
And here is the original BEATs preprocessing step:
def preprocess(
self,
source: torch.Tensor,
fbank_mean: float = 15.41663,
fbank_std: float = 6.55582,
) -> torch.Tensor:
fbanks = []
for waveform in source:
waveform = waveform.unsqueeze(0) * 2**15
fbank = ta_kaldi.fbank(waveform, num_mel_bins=128, sample_frequency=16000, frame_length=25, frame_shift=10)
fbanks.append(fbank)
fbank = torch.stack(fbanks, dim=0)
fbank = (fbank - fbank_mean) / (2 * fbank_std)
return fbank
Also I've got question about model forward pass. Here is the forwrd pass of the BEATs model and my output shapes.
def forward(
self,
source: torch.Tensor,
padding_mask: Optional[torch.Tensor] = None,
fbank_mean: float = 15.41663,
fbank_std: float = 6.55582,
):
"""Forward pass for the BEATs model.
Args:
source (torch.Tensor): Input tensor.
padding_mask (Optional[torch.Tensor]): Padding mask tensor. Defaults to None.
fbank_mean (float): Mean value for feature normalization. Defaults to 15.41663.
fbank_std (float): Standard deviation for feature normalization. Defaults to 6.55582.
Returns:
torch.Tensor: Model output tensor.
"""
# source.shape = [32, 32000] 16k per second
# prepare audio feature using original BEATs preaprator gives me output shape: [32, 198, 128]
fbank = prepare_audio_features_old(source, fbank_mean=fbank_mean, fbank_std=fbank_std)
# And I get output shape [32, 64, 161] using my function
my_fbank = prepare_audio_features(source)
if padding_mask is not None:
padding_mask = self.forward_padding_mask(fbank, padding_mask)
fbank = fbank.unsqueeze(1) # [32, 192, 128] -> [32, 1, 192, 128]
features = self.patch_embedding(fbank) # [32, 1, 192, 128] -> [32, 512, 12, 8]
features = features.reshape(features.shape[0], features.shape[1], -1) # [32, 512, 12, 8] -> [32, 512, 96]
features = features.transpose(1, 2) # [32, 512, 96]-> [32, 96, 512]
features = self.layer_norm(features)
if padding_mask is not None:
padding_mask = self.forward_padding_mask(features, padding_mask)
features = self.post_extract_proj(features) # [32, 96, 512] -> [32, 96, 768]
x = self.dropout_input(features)
x, _ = self.encoder(x, padding_mask=padding_mask) # [32, 96, 768] -> [32, 96, 768]
return x, padding_mask
And here is the question, what should I do next with emb of shape [32, 96, 768] to get [32, 768]?
@cg1177, could you summon @JustinYuu one more time? :DD
@cg1177, could you summon @JustinYuu one more time? :DD
OK
@takfate Hi, sry for the late response! For the preprocessing, you can directly use the preprocess function in BEATs:
audio_input_16khz, _ = librosa.load(audio_path, sr=sr)
# padding if you need
fbank = audio_model.preprocess(audio_input_16khz).cuda()
For the dimension of beats output, we use an average pooling layer to squeeze the temporal dimension to 1:
representation = audio_model(fbank)
pooled_audio_embeds = representation.mean(dim=1)
@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)
Am I correct in understanding that to reproduce your results, I need to follow these steps:
Download the checkpoints, namely:
- InternVideo2-stage2_1b-224p-f4.pt
- 1B_clip.pth
- chinese_alpaca_lora_7b ???
- internvl_c_13b_224px.pth
- Initialize the InternVideo2_CLIP class, to which a config containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.
Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?
@tiesanguaixia @Andy1621
hey @gracikk-ds have you figured out this already?
@tiesanguaixia Hello, we have released the extracted features at here. You can download them and replace the original features used by CG-DETR with them. You may need to modify some codes about loading features for training and inference. We will release the code soon.
Thanks a lot! Could you please share a code about how you extract the multi-modal features? I'd like to use the models to extract feature of my own data❤️
Excuse me, did you get the relevant code? I encountered a similar problem, extracting multimodal features for cg-detr.
@nickyzhi Did you get any update? I was trying to find config file or guidance on how to load the InternVideo1B stage 2 clip model similar to how it is done in the demo for InternVideo1Bstage2. Thanks in advance.
@cg1177, could you please provide direct link to chinese_alpaca_lora_7b? :)
Am I correct in understanding that to reproduce your results, I need to follow these steps:
1. Download the checkpoints, namely: * InternVideo2-stage2_1b-224p-f4.pt * 1B_clip.pth * chinese_alpaca_lora_7b ??? * internvl_c_13b_224px.pth 2. Initialize the [InternVideo2_CLIP class](https://github.com/OpenGVLab/InternVideo/blob/049860e1d7e5bbaddcb6064578906b88424d80c2/InternVideo2/multi_modality/models/internvideo2_clip.py#L16), to which a [config](https://github.com/OpenGVLab/InternVideo/blob/main/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot/1B/config_charades_mc.py) containing paths to the checkpoints mentioned above is passed. Additionally, load the 1B_clip.pth.
Alternatively, can I use the same video model as in the demo, and load the 1B_clip.pth weights into it? And just change the tokenizer and textual model to LLaMa?
@tiesanguaixia @Andy1621
Hey, can you share the config file you used to load the clip model or if you have a code that helps to load it similar to demo_notebook? I am currently confused with the chinese_alpaca path and with the intern_vl model regarding where it is loaded? Any help is highly appreciated.
Hello InternVideo team,
You guys have done a great job with this project!
In your paper, you use the Stage 2 model for the task of temporal grounding on QVHighlight [Lei et al., 2021] and Charade-STA [Gao et al., 2017]. I have a question, why not use the CLIP version for this purpose?
As you mentioned in one of the issues I saw, the CLIP one is fine-tuned from Stage 2 to support more applications (with the powerful InternVL text encoder).
Am I correct in understanding that you kept the video encoder model unchanged, and the BERT-L was replaced with another text encoder? If so, where can I obtain the weights for this encoder?
In the evaluation script, you use "your_model_path/internvl/internvl_c_13b_224px.pth", there is no such model in the InternVL repository.
@Andy1621