Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device

casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

https://casper-hansen.github.io/AutoAWQ/

MIT License

1.74k stars 208 forks source link

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

Open ShelterWFF opened 3 months ago

ShelterWFF commented 3 months ago

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
import torch

model_path = ''
quant_path = ''
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    # trust_remote_code=True,
    low_cpu_mem_usage=True,
    use_cache=False,
    # device_map='cuda:0',
    # torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

ShelterWFF commented 3 months ago

reduce transformers verison

casper-hansen commented 3 months ago

The default loading of the model in transformers seems to have changed recently. For now, you can just use device_map when needed.

FoolMark commented 3 months ago

Similar issue with following environments:

transfermers 4.42.4
AutoAWQ 0.2.6+cu118 
AutoAWQ_Kernels 0.0.6+cu118

loading with device_map auto

model = AutoAWQForCausalLM.from_pretrained(config.model_path, device_map="auto", safetensors=True)

error solved by specifying device

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

But what if the model is larger than 80GB(e.g. qwen2-72b）?

billvsme commented 3 months ago

convert meta-llama/Meta-Llama-3.1-70B-Instruct transformers must be upgraded to 4.43.x. When I use 4.43.3, I get the same error.

seolhokim commented 2 months ago

@billvsme I'm using meta-llama/Meta-Llama-3.1-70B-Instruct and i got the same error even i tried transformers==4.43.3 and 4.44.0. do i need to specify my entire env?

supa-thibaud commented 2 months ago

same issue @r4dm solution doesn't work for me as I m trying to quantize a llama3.1 fine-tuned model.

William-Wildridge commented 2 months ago

Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling.

ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers.

davedgd commented 2 months ago

For anyone watching this, consider also tracking this issue in transformers: #32420

bkutasi commented 1 month ago

Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know

davedgd commented 1 month ago

I have a potential fix that may remedy both the "two devices" error and the rope_scaling issue (by way of allowing for a newer transformers version). Feel free to try out the patch here:

https://github.com/davedgd/transformers/tree/patch-1

e.g.,

pip install git+https://github.com/davedgd/transformers@patch-1