Closed thojmr closed 10 months ago
OK. I will work on it to decrease the cpu footprint of SAT version.
You can try this:
First, install the low_cpu branch of SAT and CogVLM.
git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu
git clone https://github.com/THUDM/CogVLM -b low_cpu
cd SwissArmyTransformer
pip install . --no-deps
Then, run the cli_demo with low_cpu:
python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory
It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.
The low_memory option will take about 35GB memory ideally.
Thanks for the quick reply. It does indeed allow me to run inference now (~43 GB ram used), however I ran into a few issues in the process.
expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half
, so I have to resort to using --fp16 insteadpython3 cli_demo.py \
--from_pretrained "./local path/cogvlm-chat-v1.1" \
--version "chat" \
--english \
--fp16 \
--quant 4 \
--low_cpu_memory
Hope this helps!
Thanks for the quick reply. It does indeed allow me to run inference now (~43 GB ram used), however I ran into a few issues in the process.
- bfp16 causes an error:
expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half
, so I have to resort to using --fp16 instead- The VRAM usage is 24/24 GB with the following command. I expected it to use the same or less than the hugging face version. Is this related to using --fp16 instead of --bfp16?
python3 cli_demo.py \ --from_pretrained "./local path/cogvlm-chat-v1.1" \ --version "chat" \ --english \ --fp16 \ --quant 4 \ --low_cpu_memory
- It takes about 60-80 seconds per image at quant 4. The huggingface pipeline can actually fit quant 8 on a 3090 with 30-50 second iteration time. However I believe others have been running this at around 3-8 seconds per iteration on similar hardware, so not sure what's different.
Hope this helps!
Once this runs on ggml (llama.cpp) it should perform at 3-5 seconds per image on a 3090 (or smaller) with almost no RAM requirements - the problem is that the usual python frameworks are horribly inefficient when it comes to anything but full precision use. It all feels more like a scientific project not aimed at real world use. Sadly at this poin the only image models that run are llava based (simple projection tensors between language and vision)
You can try adding --stream_chat to decrease response latancy. You can also try quant 8 in SAT version by --quant 8.
Ah I see that --quant 8 seems to fit in 24GB too. Must be that your code is just reserving all 24GB and the huggingface pipeline is only reserving what is needed. I've got no problems with that.
--stream_chat won't help me since I'm only interested in reducing total iteration time for dataset tagging purposes.
All that is left is figuring out why --bf16 is being mismatched with torch.half somewhere when the model is done loading and attempting inference.
quantization only supports fp16 cuda kernels for now. bf16 is not supported for quantization.
Alright then. My original purpose was to see if the original implantation was any faster, so I'll stick to the 4bit huggingface pipeline for now. You may want to add a warning when using quant + bf16 as well.
Thanks for getting this working on 64 GB CPU ram.
OK. For normal 4-bit quant, although the parameters are 4-bit in memory, when they really come into computation, they will be converted to fp16 or fp32. So it will definitely be slower than using fp16 directly.
OK. For normal 4-bit quant, although the parameters are 4-bit in memory, when they really come into computation, they will be converted to fp16 or fp32. So it will definitely be slower than using fp16 directly.
You should at least take a look into the GGML library and the llava-cli / clip.cpp implementation there.
llama.cpp (ggml) can natively work with quantized tensors, so you can natively multiply tensors in 2 bit quantization up to 32, they have a kernel for all of that without losing speed. In many cases you'll actually have a faster computation that way. Which means you only need the RAM or VRAM required for the actual quantized model, it runs as fast or faster than before.
I know that. But it's not "normal". We will work on the adaptation for llama.cpp recently.
Since the low_cpu branch works, I'm going to close this issue.
Not sure if this is something I'm doing wrong - using on a 3090.
(cogvlm) nawal@rita:~/storage/chatgpt/CogVLM$ python cli_demo.py --from_pretrained cogvlm-chat-v1.1 --quant 4 --low_cpu_memory
[2024-01-07 20:26:37,122] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-07 20:26:39,233] [INFO] building CogVLMModel model ...
[2024-01-07 20:26:39,236] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-01-07 20:26:39,237] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=1. If this is wrong, please pass the LOCAL_WORLD_SIZE manually.
[2024-01-07 20:26:39,237] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
Traceback (most recent call last):
File "/home/nawal/storage/chatgpt/CogVLM/cli_demo.py", line 164, in <module>
main()
File "/home/nawal/storage/chatgpt/CogVLM/cli_demo.py", line 38, in main
model, model_args = CogVLMModel.from_pretrained(
File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 215, in from_pretrained
return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 207, in from_pretrained_base
model = get_model(args, cls, **kwargs)
File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 406, in get_model
model = model_cls(args, params_dtype=params_dtype, **kwargs)
File "/home/nawal/storage/chatgpt/CogVLM/models/cogvlm_model.py", line 104, in __init__
self.add_mixin("eva", ImageMixin(args))
File "/home/nawal/storage/chatgpt/CogVLM/models/cogvlm_model.py", line 77, in __init__
self.vit_model = EVA2CLIPModel(EVA2CLIPModel.get_args(**vars(vit_args)))
File "/home/nawal/storage/chatgpt/CogVLM/models/eva_clip_model.py", line 112, in __init__
self.add_mixin("patch_embedding", ImagePatchEmbeddingMixin(args.in_channels, args.hidden_size, property, device=args.device))
TypeError: __init__() got an unexpected keyword argument 'device'
You should pip install the low_cpu version of sat.
You can try this:
First, install the low_cpu branch of SAT and CogVLM.
git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu git clone https://github.com/THUDM/CogVLM -b low_cpu cd SwissArmyTransformer pip install . --no-deps
Then, run the cli_demo with low_cpu:
python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory
It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.
The low_memory option will take about 35GB memory ideally.
Ouch, I have 32GB, will give it a go still.
On Mon, 8 Jan 2024, 02:26 Qingsong Lv, @.***> wrote:
You should pip install the low_cpu version of sat.
You can try this:
First, install the low_cpu branch of SAT and CogVLM.
git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu git clone https://github.com/THUDM/CogVLM -b low_cpu
cd SwissArmyTransformer pip install . --no-deps
Then, run the cli_demo with low_cpu:
python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory
It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.
The low_memory option will take about 35GB memory ideally.
— Reply to this email directly, view it on GitHub https://github.com/THUDM/CogVLM/issues/162#issuecomment-1880306378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSV6BWMBSZO6NGDIZNDYNNKNJAVCNFSM6AAAAABALXFWJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGMYDMMZXHA . You are receiving this because you commented.Message ID: @.***>
Giving the new 4bit quantized option a try, I noticed that 64 GB (52 free) of CPU ram is not enough to load this model. It works fine with the hugginface pipeline due to
low_cpu_mem_usage=True
. The problem with the huggingface pileline is it's extremely slow to run inference. Like 30-40s per image on a 3090. So I wanted to give this a try to see if its any faster.What options do I have to reduce CPU memory usage on quantized model load?
Thanks!