Closed ghost closed 5 months ago
Hi @mxjyst ,
Thanks for reporting the issue. How did you setup the environment? Could you please let me know your PyTorch and Transformers version?
Sure.
torch==2.2.0a0+81ea7a4 transformers==4.39.0
Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?
python model/llama.py /Path/To/Llama/Model wikitext2 \ --wbits 4 --abits 4 --a_sym --w_sym \ --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 \ --reorder --act_sort_metric hessian \ --a_clip_ratio 0.9 --w_clip_ratio 0.85 \ --keeper 128 --keeper_precision 3 --kv_cache --use_gptq \ --eval_ppl --eval_common_sense
By the way, the library version you are using differs from our codebase (See: https://github.com/efeslab/Atom/blob/main/model/requirements.txt). So the other thing you can try is following the README.md to setup the exact env as we have.
Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?
python model/llama.py /Path/To/Llama/Model wikitext2 --wbits 4 --abits 4 --a_sym --w_sym --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 --reorder --act_sort_metric hessian --a_clip_ratio 0.9 --w_clip_ratio 0.85 --keeper 128 --keeper_precision 3 --kv_cache --use_gptq --eval_ppl --eval_common_sense
there is no llama.py
under model
.
Also, It seems like a mismatch between our layout and the Transformers library API. So could you please try the following command, which will set a quantization bit width and therefore use our own wrapper for LlamaDecodeLayer?
python model/llama.py /Path/To/Llama/Model wikitext2 --wbits 4 --abits 4 --a_sym --w_sym --act_group_size 128 --weight_group_size 128 --weight_channel_group 2 --reorder --act_sort_metric hessian --a_clip_ratio 0.9 --w_clip_ratio 0.85 --keeper 128 --keeper_precision 3 --kv_cache --use_gptq --eval_ppl --eval_common_sense
there is no
llama.py
undermodel
.
Sorry for messing up. It's main.py
. Basically following the reproduce script: https://github.com/efeslab/Atom/blob/main/scripts/run_atom_ppl.sh.
It works after I use transforerms==4.36.2. Thanks for your help.
btw, It there a comman eval api/code/benchmark that I can use in case I want to compare different quantization method?
I believe the benchmarks in our paper like WikiText-2, C4, and zero-shot tasks are widely used in quantization research. You can directly compare our results with other's reports.
If I can use the following code to eval the quantized model on all kinds of tasks supported by lm-eval,including WikiText-2?
if args.eval_common_sense:
lm = LMClass(args, model)
lm.seqlen = 2048
lm.model.eval()
for param in lm.model.parameters():
param.requires_grad = False
if args.multigpu:
if "llama" in args.model.lower():
map_layers_to_multi_gpus(lm.model.model.layers)
input_device = lm.model.model.layers[0].device
output_device = lm.model.model.layers[-1].device
assert input_device == output_device
lm._device = input_device
lm.model.model.embed_tokens.to(input_device)
lm.model.model.norm.to(output_device)
lm.model.lm_head.to(output_device)
elif "opt" in args.model.lower():
map_layers_to_multi_gpus(lm.model.model.decoder.layers)
input_device = lm.model.model.decoder.layers[0].device
output_device = lm.model.model.decoder.layers[-1].device
assert input_device == output_device
lm._device = input_device
lm.model.model.decoder.embed_tokens.to(input_device)
lm.model.model.decoder.embed_positions.to(input_device)
lm.model.model.decoder.final_layer_norm.to(input_device)
lm.model.lm_head.to(output_device)
else:
lm._device = DEV
lm.model = lm.model.to(lm.device)
results = {}
tasks_str = "piqa,arc_easy,arc_challenge,boolq,hellaswag,winogrande"
task_names = pattern_match(tasks_str.split(","), lm_tasks.ALL_TASKS)
print(f"Selected Tasks: {task_names}")
task_dict = lm_tasks.get_task_dict(task_names)
t_results = lm_evaluator.evaluate(
lm,
task_dict,
num_fewshot=args.lm_eval_num_fewshot,
limit=None if args.lm_eval_limit == -1 else args.lm_eval_limit
)
results.update(t_results)
pprint(results)
results_dict = results['results']
for task_name in tasks_str.split(','):
if task_name in ['piqa', 'arc_easy', 'arc_challenge', 'hellaswag']:
print(f"INFO {task_name} : {results_dict[task_name]['acc_norm']*100:.2f}")
else:
print(f"INFO {task_name} : {results_dict[task_name]['acc']*100:.2f}")
@mxjyst I don't think wikitext2 is included in lm_eval. Perplexity and zero-shot accuracy are two sets of evaluation. For perplexity, we compute that directly in our codebase. For zero-shot accuracy, we leverage lm_eval library to conduct the evaluation. I hope this helps!
Hi, when I tried to quant llama model, I met the following error:
the command is:
What should I do to solve this problem? looking forward to your reply.