emu2-gen A100 OOM - Githubissues

chuxing commented 11 months ago

`pipe = DiffusionPipeline.from_pretrained( path, custom_pipeline="pipeline_emu2_gen", torch_dtype=torch.bfloat16, use_safetensors=True, variant="bf16", low_cpu_mem_usage=True ) pipe.to("cuda") print(pipe) prompt = "impressionist painting of an astronaut in a jungle" ret = pipe(prompt)

prompt = [image, "wearing a red hat on the head."] ret = pipe(prompt) ` OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB A100 80G Memorty is OOM can be load in multi gpu ? or cpu offload

ryanzhangfan commented 11 months ago

OOM issue

Could you please provide more information about your environment? We've tested the huggingface version of Emu2-Gen on A800-80G GPU. In bfloat16 precision, it occupies 77GB of GPU memory and can run all the examples in README.md successfully.

Multi-GPU support

Since the EmuVisualGenerationPipeline inherited from diffusers.DiffusionPipeline, you can use any multi-gpu techniques supported by diffusers or accelerate.
Besides, we just uploaded the native PyTorch version of models and the corresponding inference codes, in which we provide a simple multi device strategy. You can try it out by following the instruction. Model weights are available at link.

chuxing commented 11 months ago

cuda: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

torch: Version: 2.1.2+cu118

nvidia: GPU 0: NVIDIA A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3: NVIDIA A100 80GB PCIe

The model can be load success, when finish from_pretrained(emu2_gen) in bfloat16 precision, it occupies 78GB of GPU memory

Then, run the generate code: 1) process text2image : pipe("impressionist painting of an astronaut in a jungle"), it work fine. but 2) process image editing: image = Image.open("tmp_file.png").convert('RGB').resize((256,256)) prompt = [image, "wearing a red hat on the beach."] ret = pipe(prompt) The model will OOM:

OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB. GPU 0 has a total capacty of 79.10 GiB of which 505.94 MiB is free. Process 2861798 has 78.60 GiB memory in use. Of the allocated memory 75.77 GiB is allocated by PyTorch, and 1.11 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

the last track is : File /opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, *kwargs) 1522 # If we don't have any hooks, we want to skip the rest of the logic in 1523 # this function, and just call forward. 1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks 1525 or _global_backward_pre_hooks or _global_backward_hooks 1526 or _global_forward_hooks or _global_forward_pre_hooks): -> 1527 return forward_call(args, **kwargs) 1529 try: 1530 result = None

File ~/.local/lib/python3.10/site-packages/diffusers/models/resnet.py:198, in Upsample2D.forward(self, hidden_states, output_size, scale) 196 # If the input is bfloat16, we cast back to bfloat16 197 if dtype == torch.bfloat16: --> 198 hidden_states = hidden_states.to(dtype) 200 # TODO(Suraj, Patrick) - clean up after weight dicts are correctly renamed 201 if self.use_conv:

chuxing commented 11 months ago

update: export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 source ~/.bashrc

限制显存一次分配的最大单位可以解决这个问题. 目前能正常运行

ryanzhangfan commented 11 months ago

Great! I'll close this issue. Feel free to open it again when encountering any other problems.

baaivision / Emu

emu2-gen A100 OOM #52

OOM issue

Multi-GPU support