casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.75k stars 208 forks source link

Support of llava-v1.5 and llava-v1.6 with transformers==4.40.0 #456

Open WanBenLe opened 6 months ago

WanBenLe commented 6 months ago

I try to run llava-v1.6-34b-hf-awq and sucessed, but how can I run the test for Llava-v1.5 ConditionalGeneration? https://github.com/casper-hansen/AutoAWQ/pull/250 The bug of example likely :

  1. max_position_embeddings and max_seq_length
  2. the LlavaForConditionalGeneration.~.forward() in newly transformers v4.4.0 will with input_ids=None (so /AutoAWQ-main/awq/modules/fused/model.py 3.7 input_ids, self.last_forward_num_tokens = fused_utils.prepare_input_ids( input_ids, self.last_forward_num_tokens ) input_ids.shape will raise error) Does this mean I need to modify the contents of both the Llava and LlavaNext sections? Waiting for anaswer, best wishes!
WanBenLe commented 6 months ago

I have fixed the issue of llava-v1.5 in the latest transformers version and supported llava-v1.6 (LLavaNext), can I create a PR for these? with /root/autodl-tmp/wsl/AutoAWQ-main/awq/modules/fused/model.py and others

        if input_ids == None and kwargs['past_key_values']==None:
            input_ids, self.last_forward_num_tokens = fused_utils.prepare_input_ids(
                kwargs['position_ids'], self.last_forward_num_tokens
            )
            _bsz, seqlen = kwargs['position_ids'].shape
            h = kwargs['inputs_embeds']
            device = h.device
        else:
            input_ids, self.last_forward_num_tokens = fused_utils.prepare_input_ids(
                input_ids, self.last_forward_num_tokens
            )
            _bsz, seqlen = input_ids.shape
            device = input_ids.device
            h = self.embedding(input_ids)
        fused_utils.prepare_cache(self.blocks, seqlen)
        mask = fused_utils.prepare_attention_mask(
            seqlen=seqlen,
            start_pos=self.blocks[0].attn.start_pos,
            device=device,
            type_as=h,
        )
casper-hansen commented 6 months ago

Hi @WanBenLe, please create a PR with a description of the issue and how this solves your problem

zjysteven commented 5 months ago

Hi @WanBenLe, would you mind sharing the exact script for quantization to get llava-v1.6-34b-hf-awq? @casper-hansen I'm also wondering if the PR will be merged soon

WanBenLe commented 5 months ago

Hi @WanBenLe, would you mind sharing the exact script for quantization to get llava-v1.6-34b-hf-awq? @casper-hansen I'm also wondering if the PR will be merged soon

For the unmerged version(AutoAWQ==0.25), the code and examples for llava-next support are here: https://github.com/WanBenLe/AutoAWQ-with-llava-v1.6/blob/main/examples/llavanext.py

If you plan to use multimodal data,(AutoAWQ==0.24): https://github.com/WanBenLe/AutoAWQ-with-quantizer/blob/main/examples/multimodal_inputs_prepare.py https://github.com/WanBenLe/AutoAWQ-with-quantizer/tree/main/examples/multimodal_quant_test.py

Use default calibration data setting will raise loss NaN error (AutoAWQ/tree/main/awq/quantize.py 343), maybe.

1SingleFeng commented 4 months ago

你好,请问autoawq 0.2.5支持llava 1.5吗,能给一下示例代码吗,要求的最低transformers版本是什么?

WanBenLe commented 4 months ago

你好,请问autoawq 0.2.5支持llava 1.5吗,能给一下示例代码吗,要求的最低transformers版本是什么?

你要不直接用example的官方示例+我的那个代码试试?应该能跑起来. 下面是两个链接,一个是原始的llava-v1.5pr https://github.com/casper-hansen/AutoAWQ/pull/250 一个是新的等待merged的pr https://github.com/casper-hansen/AutoAWQ/pull/471

1SingleFeng commented 4 months ago

你好,请问autoawq 0.2.5支持llava 1.5吗,能给一下示例代码吗,要求的最低transformers版本是什么?

你要不直接用示例的官方示例+我的那个代码试试?应该能运行起来。 下面是两个链接,一个是原始的llava-v1.5pr #250 ,一个是新的等待合并的pr #471

好的,非常感谢,我将尝试一下

1SingleFeng commented 4 months ago

@WanBenLe 你好,我在尝试AutoAWQ-with-llava-v1.6 时需要如下问题,请问你知道如何解决吗 Traceback (most recent call last): File "/home/common/singlefeng/AIGC_TRAIN/AutoAWQ-with-llava-v1.6_20240624/quantize_llava.py", line 22, in model.quantize(tokenizer, quant_config=quant_config) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/home/common/singlefeng/AIGC_TRAIN/AutoAWQ-with-llava-v1.6_20240624/awq/models/base.py", line 181, in quantize self.quantizer = AwqQuantizer( File "/home/common/singlefeng/AIGC_TRAIN/AutoAWQ-with-llava-v1.6_20240624/awq/quantize/quantizer.py", line 61, in init self.modules, self.module_kwargs, self.inps = self.init_quant() File "/home/common/singlefeng/AIGC_TRAIN/AutoAWQ-with-llava-v1.6_20240624/awq/quantize/quantizer.py", line 482, in init_quant self.model(samples.to(next(self.model.parameters()).device)) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, kwargs) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 420, in forward inputs_embeds = self.get_input_embeddings()(input_ids) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, **kwargs) File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward return F.embedding( File "/home/common/anaconda3/envs/auto_awq/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

我的量化代码如下 from awq import AutoAWQForCausalLM from transformers import AutoTokenizer

model_path = 'llava-1.5-7b-hf' quant_path = 'llava-1.5-7b-hf-awq'

quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

model = AutoAWQForCausalLM.from_pretrained( model_path ) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

WanBenLe commented 4 months ago

@1SingleFeng 你可以看一下是不是你的model还是数据没有to(cuda)之类的,如果你不打算用cuda可以设置os的cuda设备为空