4bit quant 64GB CPU memory not enough

thojmr commented 11 months ago

Giving the new 4bit quantized option a try, I noticed that 64 GB (52 free) of CPU ram is not enough to load this model. It works fine with the hugginface pipeline due to low_cpu_mem_usage=True. The problem with the huggingface pileline is it's extremely slow to run inference. Like 30-40s per image on a 3090. So I wanted to give this a try to see if its any faster.

What options do I have to reduce CPU memory usage on quantized model load?

Thanks!

1049451037 commented 10 months ago

OK. I will work on it to decrease the cpu footprint of SAT version.

1049451037 commented 10 months ago

You can try this:

First, install the low_cpu branch of SAT and CogVLM.

git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu
git clone https://github.com/THUDM/CogVLM -b low_cpu

cd SwissArmyTransformer
pip install . --no-deps

Then, run the cli_demo with low_cpu:

python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory

It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.

The low_memory option will take about 35GB memory ideally.

thojmr commented 10 months ago

Thanks for the quick reply. It does indeed allow me to run inference now (~43 GB ram used), however I ran into a few issues in the process.

bfp16 causes an error: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half, so I have to resort to using --fp16 instead
The VRAM usage is 24/24 GB with the following command. I expected it to use the same or less than the hugging face version. Is this related to using --fp16 instead of --bfp16?
```
python3 cli_demo.py \
--from_pretrained "./local path/cogvlm-chat-v1.1" \
--version "chat" \
--english \
--fp16 \
--quant 4 \
--low_cpu_memory
```
It takes about 60-80 seconds per image at quant 4. The huggingface pipeline can actually fit quant 8 on a 3090 with 30-50 second iteration time. However I believe others have been running this at around 3-8 seconds per iteration on similar hardware, so not sure what's different.

Hope this helps!

cmp-nct commented 10 months ago

Thanks for the quick reply. It does indeed allow me to run inference now (~43 GB ram used), however I ran into a few issues in the process.

bfp16 causes an error: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != c10::Half, so I have to resort to using --fp16 instead

The VRAM usage is 24/24 GB with the following command. I expected it to use the same or less than the hugging face version. Is this related to using --fp16 instead of --bfp16?
python3 cli_demo.py \
    --from_pretrained "./local path/cogvlm-chat-v1.1" \
    --version "chat" \
    --english \
    --fp16 \
    --quant 4 \
    --low_cpu_memory
It takes about 60-80 seconds per image at quant 4. The huggingface pipeline can actually fit quant 8 on a 3090 with 30-50 second iteration time. However I believe others have been running this at around 3-8 seconds per iteration on similar hardware, so not sure what's different.

Hope this helps!

Once this runs on ggml (llama.cpp) it should perform at 3-5 seconds per image on a 3090 (or smaller) with almost no RAM requirements - the problem is that the usual python frameworks are horribly inefficient when it comes to anything but full precision use. It all feels more like a scientific project not aimed at real world use. Sadly at this poin the only image models that run are llava based (simple projection tensors between language and vision)

1049451037 commented 10 months ago

You can try adding --stream_chat to decrease response latancy. You can also try quant 8 in SAT version by --quant 8.

thojmr commented 10 months ago

Ah I see that --quant 8 seems to fit in 24GB too. Must be that your code is just reserving all 24GB and the huggingface pipeline is only reserving what is needed. I've got no problems with that.

--stream_chat won't help me since I'm only interested in reducing total iteration time for dataset tagging purposes.

All that is left is figuring out why --bf16 is being mismatched with torch.half somewhere when the model is done loading and attempting inference.

1049451037 commented 10 months ago

quantization only supports fp16 cuda kernels for now. bf16 is not supported for quantization.

thojmr commented 10 months ago

Alright then. My original purpose was to see if the original implantation was any faster, so I'll stick to the 4bit huggingface pipeline for now. You may want to add a warning when using quant + bf16 as well.

Thanks for getting this working on 64 GB CPU ram.

1049451037 commented 10 months ago

OK. For normal 4-bit quant, although the parameters are 4-bit in memory, when they really come into computation, they will be converted to fp16 or fp32. So it will definitely be slower than using fp16 directly.

cmp-nct commented 10 months ago

OK. For normal 4-bit quant, although the parameters are 4-bit in memory, when they really come into computation, they will be converted to fp16 or fp32. So it will definitely be slower than using fp16 directly.

You should at least take a look into the GGML library and the llava-cli / clip.cpp implementation there.

llama.cpp (ggml) can natively work with quantized tensors, so you can natively multiply tensors in 2 bit quantization up to 32, they have a kernel for all of that without losing speed. In many cases you'll actually have a faster computation that way. Which means you only need the RAM or VRAM required for the actual quantized model, it runs as fast or faster than before.

1049451037 commented 10 months ago

I know that. But it's not "normal". We will work on the adaptation for llama.cpp recently.

thojmr commented 10 months ago

Since the low_cpu branch works, I'm going to close this issue.

husnoo commented 9 months ago

Not sure if this is something I'm doing wrong - using on a 3090.

(cogvlm) nawal@rita:~/storage/chatgpt/CogVLM$ python cli_demo.py --from_pretrained cogvlm-chat-v1.1 --quant 4 --low_cpu_memory
[2024-01-07 20:26:37,122] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-07 20:26:39,233] [INFO] building CogVLMModel model ...
[2024-01-07 20:26:39,236] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-01-07 20:26:39,237] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=1. If this is wrong, please pass the LOCAL_WORLD_SIZE manually.
[2024-01-07 20:26:39,237] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
Traceback (most recent call last):
  File "/home/nawal/storage/chatgpt/CogVLM/cli_demo.py", line 164, in <module>
    main()
  File "/home/nawal/storage/chatgpt/CogVLM/cli_demo.py", line 38, in main
    model, model_args = CogVLMModel.from_pretrained(
  File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 215, in from_pretrained
    return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
  File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 207, in from_pretrained_base
    model = get_model(args, cls, **kwargs)
  File "/home/nawal/storage/miniforge3/envs/cogvlm/lib/python3.9/site-packages/sat/model/base_model.py", line 406, in get_model
    model = model_cls(args, params_dtype=params_dtype, **kwargs)
  File "/home/nawal/storage/chatgpt/CogVLM/models/cogvlm_model.py", line 104, in __init__
    self.add_mixin("eva", ImageMixin(args))
  File "/home/nawal/storage/chatgpt/CogVLM/models/cogvlm_model.py", line 77, in __init__
    self.vit_model = EVA2CLIPModel(EVA2CLIPModel.get_args(**vars(vit_args)))
  File "/home/nawal/storage/chatgpt/CogVLM/models/eva_clip_model.py", line 112, in __init__
    self.add_mixin("patch_embedding", ImagePatchEmbeddingMixin(args.in_channels, args.hidden_size, property, device=args.device))
TypeError: __init__() got an unexpected keyword argument 'device'

1049451037 commented 9 months ago

You should pip install the low_cpu version of sat.

You can try this:

First, install the low_cpu branch of SAT and CogVLM.
git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu
git clone https://github.com/THUDM/CogVLM -b low_cpu

cd SwissArmyTransformer
pip install . --no-deps
Then, run the cli_demo with low_cpu:
python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory
It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.

The low_memory option will take about 35GB memory ideally.

husnoo commented 9 months ago

Ouch, I have 32GB, will give it a go still.

On Mon, 8 Jan 2024, 02:26 Qingsong Lv, @.***> wrote:

You should pip install the low_cpu version of sat.

You can try this:

First, install the low_cpu branch of SAT and CogVLM.

git clone https://github.com/THUDM/SwissArmyTransformer -b low_cpu git clone https://github.com/THUDM/CogVLM -b low_cpu

cd SwissArmyTransformer pip install . --no-deps

Then, run the cli_demo with low_cpu:

python cli_demo.py --version chat --from_pretrained cogvlm-chat-v1.1 --fp16 --english --quant 4 --low_cpu_memory

It should work. But I don't have a computer with limited memory to test... You can try this. If it works, I will merge it to main branch. If not, feel free to tell me what problem you are facing.

The low_memory option will take about 35GB memory ideally.

— Reply to this email directly, view it on GitHub https://github.com/THUDM/CogVLM/issues/162#issuecomment-1880306378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAWJWSV6BWMBSZO6NGDIZNDYNNKNJAVCNFSM6AAAAABALXFWJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGMYDMMZXHA . You are receiving this because you commented.Message ID: @.***>

THUDM / CogVLM

4bit quant 64GB CPU memory not enough #162