PKU-YuanGroup / Video-LLaVA

【EMNLP 2024🔥】Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
https://arxiv.org/pdf/2311.10122.pdf
Apache License 2.0
2.88k stars 207 forks source link

Distributed Inference Doesn't Work #102

Open dfan opened 7 months ago

dfan commented 7 months ago

I followed the instructions in the README to install the conda environment and run the video inference sample code. I get an error

line 289, in prepare_inputs_labels_for_multimodal
    cur_new_input_embeds = torch.cat(cur_new_input_embeds)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:6! (when checking argument for argument tensors in method wrapper_CUDA_cat)

The code runs successfully when I limit the visible cuda devices to a single GPU, e.g. CUDA_VISIBLE_DEVICES=0.

Lr-2002 commented 7 months ago

HELLO @dfan what gpu are you using ? Are I'm facing the same problem as you, Have you solve this problem ?

Lr-2002 commented 7 months ago

ok guys, i fixed the same problem using the command --device "cuda:0" might help you

LinB203 commented 7 months ago

CUDA_VISIBLE_DEVICES=0 your_script.py

Lr-2002 commented 7 months ago

Thx ~

shouborno commented 5 months ago

@dfan , were you able to run it with multiple GPUs? I also need distributed inference.

shouborno commented 5 months ago

@dfan, I've fixed it in https://github.com/PKU-YuanGroup/Video-LLaVA/pull/145. We no longer need to restrict inference to a single device (e.g., cuda:0). With this PR, we can distribute the inference to as many GPUs as we want (e.g., coda:0,1 for 0 and 1 or cuda for all GPUs available).

@LinB203 , please check this.

dandre0102 commented 5 months ago

@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?

shouborno commented 5 months ago

@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?

It should allow using all your GPUs. For example, if your VRAM runs out of allocatable mamory space with a single GPU, that shouldn't happen with this LLaVA distributed inference fix.

dandre0102 commented 5 months ago

@shouborno, I'm having the same issue. Just to make sure I understand your fix: doesn't it just write everything onto 1 again? How can I use all GPU available?

It should allow using all your GPUs. For example, if your VRAM runs out of allocatable mamory space with a single GPU, that shouldn't happen with this LLaVA distributed inference fix.

After implementing your proposed fix: `[...] cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))

        cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]

        cur_new_input_embeds = torch.cat(cur_new_input_embeds)
        cur_new_labels = torch.cat(cur_new_labels)`

`File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, *kwargs) 1515 # If we don't have any hooks, we want to skip the rest of the logic in 1516 # this function, and just call forward. 1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1518 or _global_backward_pre_hooks or _global_backward_hooks 1519 or _global_forward_hooks or _global_forward_pre_hooks): -> 1520 return forward_call(args, **kwargs) 1522 try: 1523 result = None

File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module..new_forward(*args, kwargs) 163 output = old_forward(*args, *kwargs) 164 else: --> 165 output = old_forward(args, kwargs) 166 return module._hf_hook.post_forward(module, output)

File ~/anaconda3/envs/languagebind/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py:346, in LlamaAttention.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache) 343 attn_weights = attn_weights + attention_mask 345 # upcast attention to fp32 --> 346 attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) 347 attn_output = torch.matmul(attn_weights, value_states) 349 if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.41 GiB. GPU 0 has a total capacity of 14.58 GiB of which 940.50 MiB is free.`

Running it without your fix throws the 'expected' error

line 289, in prepare_inputs_labels_for_multimodal cur_new_input_embeds = torch.cat(cur_new_input_embeds) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:6! (when checking argument for argument tensors in method wrapper_CUDA_cat)

shouborno commented 5 months ago

@dandre0102 , it might depend on your data and the number of tokens. For example, for my use case, with a total of 48GB GPU (cuda:0,1), I need 4bit quantization. With cuda:0 or cuda:1 on their own (24GB), it runs out of memory even with 4bit quantization.