efeslab / Atom

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
259 stars 21 forks source link

RuntimeError when quant llama model #12

Closed ghost closed 5 months ago

ghost commented 5 months ago

Hi, when I tried to quant llama model, I met the following error:

Traceback (most recent call last):
  File "/workspace/code/atom-main/model/main.py", line 205, in <module>
    act_scales = get_act_stats_func(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/yangshangtong/code/atom-main/model/outlier.py", line 95, in get_act_stats_llama
    outs[j] = layer(inps[j].unsqueeze(0), attention_mask=attention_mask, position_ids=position_ids)[0]
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 750, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1510, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1519, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 681, in forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
RuntimeError: The expanded size of the tensor (2048) must match the existing size (32768) at non-singleton dimension 3.  Target sizes: [1, 52, 2048, 2048].  Tensor sizes: [1, 1, 32768, 32768]

the command is:

python model/main.py /workspace/model/llama2-7B wikitext2 \
    --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \
    --reorder --act_sort_metric hessian \
    --a_clip_ratio 0.9 --w_clip_ratio 0.85 \
    --keeper 128 --keeper_precision 3 --kv_cache \
    --eval_ppl

What should I do to solve this problem? looking forward to your reply.

happierpig commented 5 months ago

Hi @mxjyst ,

Thanks for reporting the issue. How did you setup the environment? Could you please let me know your PyTorch and Transformers version?

ghost commented 5 months ago

Sure.

torch==2.2.0a0+81ea7a4 transformers==4.39.0

happierpig commented 5 months ago

Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?

python model/llama.py /Path/To/Llama/Model wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense

happierpig commented 5 months ago

By the way, the library version you are using differs from our codebase (See: https://github.com/efeslab/Atom/blob/main/model/requirements.txt). So the other thing you can try is following the README.md to setup the exact env as we have.

ghost commented 5 months ago

Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?

python model/llama.py /Path/To/Llama/Model wikitext2 --wbits 4 --abits 4 --a_sym --w_sym --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 --reorder --act_sort_metric hessian --a_clip_ratio 0.9 --w_clip_ratio 0.85 --keeper 128 --keeper_precision 3 --kv_cache --use_gptq --eval_ppl --eval_common_sense

there is no llama.py under model.

happierpig commented 5 months ago

Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?

python model/llama.py /Path/To/Llama/Model wikitext2 --wbits 4 --abits 4 --a_sym --w_sym --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 --reorder --act_sort_metric hessian --a_clip_ratio 0.9 --w_clip_ratio 0.85 --keeper 128 --keeper_precision 3 --kv_cache --use_gptq --eval_ppl --eval_common_sense

there is no llama.py under model.

Sorry for messing up. It's main.py. Basically following the reproduce script: https://github.com/efeslab/Atom/blob/main/scripts/run_atom_ppl.sh.

ghost commented 5 months ago

It works after I use transforerms==4.36.2. Thanks for your help.

btw, It there a comman eval api/code/benchmark that I can use in case I want to compare different quantization method?

happierpig commented 5 months ago

I believe the benchmarks in our paper like WikiText-2, C4, and zero-shot tasks are widely used in quantization research. You can directly compare our results with other's reports.

ghost commented 5 months ago

If I can use the following code to eval the quantized model on all kinds of tasks supported by lm-eval,including WikiText-2?

if args.eval_common_sense:
        lm = LMClass(args, model)
        lm.seqlen = 2048
        lm.model.eval()
        for param in lm.model.parameters():
            param.requires_grad = False

        if args.multigpu:
            if "llama" in args.model.lower():
                map_layers_to_multi_gpus(lm.model.model.layers)
                input_device = lm.model.model.layers[0].device
                output_device = lm.model.model.layers[-1].device
                assert input_device == output_device
                lm._device = input_device
                lm.model.model.embed_tokens.to(input_device)
                lm.model.model.norm.to(output_device)
                lm.model.lm_head.to(output_device)
            elif "opt" in args.model.lower():
                map_layers_to_multi_gpus(lm.model.model.decoder.layers)
                input_device = lm.model.model.decoder.layers[0].device
                output_device = lm.model.model.decoder.layers[-1].device
                assert input_device == output_device
                lm._device = input_device
                lm.model.model.decoder.embed_tokens.to(input_device)
                lm.model.model.decoder.embed_positions.to(input_device)
                lm.model.model.decoder.final_layer_norm.to(input_device)
                lm.model.lm_head.to(output_device)
        else:
            lm._device = DEV
            lm.model = lm.model.to(lm.device)

        results = {}
        tasks_str = "piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande"
        task_names = pattern_match(tasks_str.split(","), lm_tasks.ALL_TASKS)
        print(f"Selected Tasks: {task_names}")

        task_dict = lm_tasks.get_task_dict(task_names)
        t_results = lm_evaluator.evaluate(
            lm,
            task_dict,
            num_fewshot=args.lm_eval_num_fewshot,
            limit=None if args.lm_eval_limit == -1 else args.lm_eval_limit
        )
        results.update(t_results)
        pprint(results)

        results_dict = results['results']
        for task_name in tasks_str.split(','):
            if task_name in ['piqa', 'arc_easy', 'arc_challenge', 'hellaswag']:
                print(f"INFO {task_name} : {results_dict[task_name]['acc_norm']*100:.2f}")
            else:
                print(f"INFO {task_name} : {results_dict[task_name]['acc']*100:.2f}")
cylinbao commented 5 months ago

@mxjyst I don't think wikitext2 is included in lm_eval. Perplexity and zero-shot accuracy are two sets of evaluation. For perplexity, we compute that directly in our codebase. For zero-shot accuracy, we leverage lm_eval library to conduct the evaluation. I hope this helps!