casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.72k stars 205 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #510

Open xieziyi881 opened 4 months ago

xieziyi881 commented 4 months ago

`from awq import AutoAWQForCausalLM from awq.utils.utils import get_best_device from transformers import AutoTokenizer, TextStreamer

quant_path = "/workspace/awq_model"

if get_best_device() == "cpu": model = AutoAWQForCausalLM.from_quantized(quant_path, use_qbits=True, fuse_layers=False) else: model = AutoAWQForCausalLM.from_quantized(quant_path, fuse_layers=True,device_map="balanced") tokenizer = AutoTokenizer.from_pretrained(quant_path, trust_remote_code=True)

初始化流式输出器

streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

prompt = "You're standing on the surface of the Earth. "\ "You walk one mile south, one mile west and one mile north. "\ "You end up exactly where you started. Where are you?"

chat = [ {"role": "system", "content": "You are a concise assistant that helps answer questions."}, {"role": "user", "content": prompt}, ]

terminators = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>") ]

tokens = tokenizer.apply_chat_template( chat, return_tensors="pt" ) tokens = tokens.to("cuda:0")

generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=64, eos_token_id=terminators ) `

Here's my script for the quantized model,However, I have the following error, how can I fix it? image

ryan0980 commented 3 months ago

you can try import os os.environ['CUDA_VISIBLE_DEVICES'] = '6' device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') model.to(device)

works for me