Open mcpaulgeorge opened 1 month ago
HItting the same issue with --multigpu and even without it
Hi, there:
I changed 'LMClass.py' with self.model = AutoModelForCausalLM.from_pretrained(args.model, config=config, device_map='auto',torch_dtype=torch.float16)
and self._device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
fixed this problem.
Then change cos, sin = self.rotary_emb(value_states, position_ids=position_ids) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
in _'models/int_llamalayer.py'
Finally change cache = {"i": 0, "attention_mask": None}
and
class Catcher(nn.Module):
def __init__(self, module):
super().__init__()
self.module = module
self.is_llama = False
def forward(self, inp, **kwargs):
inps[cache["i"]] = inp
cache["i"] += 1
# cache["attention_mask"] = kwargs["attention_mask"]
if self.is_llama:
cache["position_ids"] = kwargs["position_ids"]
raise ValueError`
in quantize/omniquant.py Hope it can be helpful to you. My 'transformers' is 4.44.2
Hi, there: I changed 'LMClass.py' with
self.model = AutoModelForCausalLM.from_pretrained(args.model, config=config, device_map='auto',torch_dtype=torch.float16)
andself._device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
fixed this problem. Then changecos, sin = self.rotary_emb(value_states, position_ids=position_ids) query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
in _'models/int_llamalayer.py' Finally changecache = {"i": 0, "attention_mask": None}
andclass Catcher(nn.Module): def __init__(self, module): super().__init__() self.module = module self.is_llama = False def forward(self, inp, **kwargs): inps[cache["i"]] = inp cache["i"] += 1 # cache["attention_mask"] = kwargs["attention_mask"] if self.is_llama: cache["position_ids"] = kwargs["position_ids"] raise ValueError`
in quantize/omniquant.py Hope it can be helpful to you. My 'transformers' is 4.44.2
Not use --multigpu, and change with:
hf_device_map = model.hf_device_map
print(hf_device_map)
for i in range(len(layers)):
logger.info(f"=== Start quantize layer {i} ===")
print(f'================={i}==================')
hf_device = f"cuda:{hf_device_map[f'{layer_name_prefix}.{i}']}"
layer = layers[i].to(hf_device)
inps = inps.to(hf_device)
position_ids = position_ids.to(hf_device)
if don't set # cache["attention_mask"] = kwargs["attention_mask"]
, has error ValueError: Attention mask should be of size (1, 1, 2048, 2048), but is torch.Size([1, 1, 2048, 2049])
The server is three A10s(24 G), I didn't add --multigpu.