PKU-YuanGroup / LanguageBind

【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
https://arxiv.org/abs/2310.01852
MIT License
549 stars 44 forks source link

Seeing excessive GPU memory usage during inference #3

Closed abhimanyu891998 closed 7 months ago

abhimanyu891998 commented 7 months ago

Hi, Great work and thanks for open sourcing, I was trying your model on 150 video clips and audio clips, each clip is of length 5 seconds. Below is a screenshot of the code I am using. Here, the array, video_clips and audio_files are of size 150. During the embedding generation, the GPU consumes more than 8 GB of memory and the embedding generation stops. I tried the exact same sample with imageBind, but that seems to work fine during inference and embedding generation. Any idea if I am doing something wrong?

device = 'cuda:0'
device = torch.device(device)
clip_type = ('video', 'audio')
model = LanguageBind(clip_type=clip_type)
model = model.to(device)
model.eval()
pretrained_ckpt = f'lb203/LanguageBind_Video'

tokenizer = LanguageBindVideoTokenizer.from_pretrained(pretrained_ckpt, cache_dir='./cache_dir/tokenizer_cache_dir')
modality_transform = {c: transform_dict[c](model.modality_config[c]) for c in clip_type}

inputs = {
    'video': to_device(modality_transform['video'](video_clips), device),
    'audio': to_device(modality_transform['audio'](audio_files), device),
}

inputs['language'] = to_device(tokenizer(transcriptions_list, max_length=77, padding='max_length',
                                         truncation=True, return_tensors='pt'), device)

with torch.no_grad():
    embeddings = model(inputs)
LinB203 commented 7 months ago

Sorry for the late reply, we have been training a stronger AUDIO model for the last few days and have now updated.

It seems you have a batch size of 150, which is too big for 8GB. You could try entering 2-4 samples at a time. If you want to compute the similarity matrix on 150 samples, then you should feed those samples into the model in batches and stack their feature at the end.

abhimanyu891998 commented 7 months ago

thank you! That helped, will try out the new audio model too!