Open phuchm opened 8 months ago
To my experience, GPU with 40GB VRAM is needed to run, where uploading 4 external cv models and MoAI needs about 20GB and then propagation needs another 10GB
But as I understand, your code already used 4-bit loading, it means that we need at least 100 GB VRAM without 4-bit loading?
To my experience, the memory between nearly 30GB and 40GB was occupied without no compression.
For 4-bit inference, the memory between nearly 20GB ~ 30GB was occupied.
I think memory reduction (from torch.cuda.empty_cache()) is worked between the time uploading all models and right before propagation
@ByungKwanLee thank you so much for your explanation!
How to define the device_map for Inference on Multiple GPUs? Thanks.
You should add extra modules with Pytorch DDP or Accelereate in Huggingface!
I used 16x2 GPU NVIDIA GeForce RTX 4080 to try to run demo.py but get an error message as below: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 15.70 GiB of which 2.62 MiB is free. Including non-PyTorch memory, this process has 15.69 GiB memory in use. Of the allocated memory 15.42 GiB is allocated by PyTorch, and 4.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)