Open AndrewToomey opened 2 months ago
It appears that the model is loaded in bf16 instead of the 32 that the model was created in. Usually, this means the model is loaded into system ram and then quantized into the lesser depth -> into VRAM. Now obviously, with the model size being so large, this will require a LOT of system ram. I looked it up, and t5_xxl requires a little over 20.5GB VRAM when run in bf16 (Down from the 44GB if you run it in 32). It will also unload the system ram once quantized into VRAM.
This isn't my repo and I cannot test this right now as I am captioning music for training so my gpu is in use, but it may be possible to either change the t5 model into xl or another smaller model, load xxl as int8 or int4, OR permanently save a quantized version of xxl at bf16 and use that.
Easiest thing to start with: How much system ram and vram do you have? I would assume that the hangup is either the free system ram is too low to quantize, causing it to use swap (SLOW) or there is not enough vram and you have newer nvidia drivers, which will load and unload parts of the model into system ram to prevent an out of memory error (ALSO SLOW).
i just test it
This is on the most current pull of the repo: Inference should entirely fit within a 24GB vram card, even on windows with the UI and everything loaded. @AndrewToomey Were you talking about training the model or just generating samples? @mejikomtv Could you post if you were testing inference (generating samples) or training, and the GPU vram available?
OK. So on a 24GB card (3090) it takes roughly 20 seconds to do a single inference currently. Easiest way to test is just copy the config\example.txt file, remove all but one line, then sample with this new file supplied as the prompt. It stays just under the vram limit for 24GB. Interestingly, it takes roughly 2 minutes to generate two samples if you leave two prompts in the new examples.txt file.
This is taking about 2 hours with the smallest model.
I presume the issue is that my GPU cannot load a t5_XXL model into memory. According to the Huggingface page the model weights are 44.5 Gb.
Is there any possibility of switching it out for a GGUF of the t5_XXL or at least quantizing with bitsandbytes? (Just the encoder)?