PygmalionAI / aphrodite-engine

PygmalionAI's large-scale inference engine
https://pygmalion.chat
GNU Affero General Public License v3.0
883 stars 96 forks source link

[Usage]: Set memory usuage for each gpu seperately #532

Closed Abdulhanan535 closed 1 week ago

Abdulhanan535 commented 1 month ago

Your current environment

i7-13th Gen, 30GB ram ddr4, 2 T4 (15x2)GB

How would you like to use Aphrodite?

I want to run llama 3 8B model (Pretrained) I don't want to quantize it, it takes 16gb of vram with 8192 ctx and every optimization, i have 2 T4 gpus but the issue is that if i run on 1 t4 gpu (15Gb) it crashed but if i run on both, Both gpu usuage is 14Gb, though i want to use 7 gb vram portion of other gpu for Stable Diffusion, i've tried using the gpu vram percent parameter but it limits gpu vram on both gpu and model crashes (COM) error. (cuda out of memory)

sgsdxzy commented 1 month ago

Due to the nature of tensor parallelism, it is not possible. You may consider using exllamav2/tabbyapi for this case, they can run unquantized models in addition to exl2.