OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
671 stars 52 forks source link

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm) #89

Open mcpaulgeorge opened 1 month ago

mcpaulgeorge commented 1 month ago

The server is three A10s(24 G), I didn't add --multigpu.

mcpaulgeorge commented 1 month ago

Screenshot 2024-08-05 172010

shubhra commented 1 month ago

HItting the same issue with --multigpu and even without it

SSshuishui commented 1 week ago

Hi, there: I changed 'LMClass.py' with self.model = AutoModelForCausalLM.from_pretrained(args.model, config=config, device_map='auto',torch_dtype=torch.float16) and self._device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") fixed this problem. Then change cos, sin = self.rotary_emb(value_states, position_ids=position_ids) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) in _'models/int_llamalayer.py' Finally change cache = {"i": 0, "attention_mask": None} and

class Catcher(nn.Module):
        def __init__(self, module):
            super().__init__()
            self.module = module
            self.is_llama = False
        def forward(self, inp, **kwargs):
            inps[cache["i"]] = inp
            cache["i"] += 1
            # cache["attention_mask"] = kwargs["attention_mask"]
            if self.is_llama:
                cache["position_ids"] = kwargs["position_ids"]
            raise ValueError` 

in quantize/omniquant.py Hope it can be helpful to you. My 'transformers' is 4.44.2

SSshuishui commented 4 days ago

Hi, there: I changed 'LMClass.py' with self.model = AutoModelForCausalLM.from_pretrained(args.model, config=config, device_map='auto',torch_dtype=torch.float16) and self._device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") fixed this problem. Then change cos, sin = self.rotary_emb(value_states, position_ids=position_ids) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) in _'models/int_llamalayer.py' Finally change cache = {"i": 0, "attention_mask": None} and

class Catcher(nn.Module):
        def __init__(self, module):
            super().__init__()
            self.module = module
            self.is_llama = False
        def forward(self, inp, **kwargs):
            inps[cache["i"]] = inp
            cache["i"] += 1
            # cache["attention_mask"] = kwargs["attention_mask"]
            if self.is_llama:
                cache["position_ids"] = kwargs["position_ids"]
            raise ValueError` 

in quantize/omniquant.py Hope it can be helpful to you. My 'transformers' is 4.44.2

Not use --multigpu, and change with:

hf_device_map = model.hf_device_map
print(hf_device_map)

for i in range(len(layers)):
    logger.info(f"=== Start quantize layer {i} ===")
    print(f'================={i}==================')
    hf_device = f"cuda:{hf_device_map[f'{layer_name_prefix}.{i}']}"
    layer = layers[i].to(hf_device)
    inps = inps.to(hf_device)
    position_ids = position_ids.to(hf_device)

if don't set # cache["attention_mask"] = kwargs["attention_mask"], has error ValueError: Attention mask should be of size (1, 1, 2048, 2048), but is torch.Size([1, 1, 2048, 2049])