Closed catid closed 1 year ago
GPTQ-triton is not currently compatible with transformers
HEAD. You'll need to use a transformers commit before 7dcd870
. For example, pip install git+https://github.com/huggingface/transformers.git@5506d0496957cde19318eee3d34ee682b654abe8
.
I've been holding off on compatibility because there's a performance regression in transformers
after commit 7dcd870
.
Sorry for the rough edge here.
Looks like you mean that I should run the llama.py GPTQ script again after changing to this version of transformers, since that doesn't fix the convert_weights.py script. It's crunching...
Right, yes. Let me know if that fixes things.
This seems to emit a different error:
(gptq) ➜ GPTQ-for-LLaMa git:(triton) ✗ python llama.py ~/models/llama-hf c4 --wbits 4 --true-sequential --act-order --groupsize -1 --save_safetensors llama65b-4bit-nog-old-transformers.safetensors
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 14/14 [01:53<00:00, 8.12s/it]
Found cached dataset json (/home/catid/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Found cached dataset json (/home/catid/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Starting ...
Traceback (most recent call last):
File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 457, in <module>
quantizers = llama_sequential(model, dataloader, DEV)
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 54, in llama_sequential
model(batch[0].to(dev))
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 49, in forward
cache['position_ids'] = kwargs['position_ids']
KeyError: 'position_ids'
Should I be on a specific branch of GPTQ-for-LLaMa? Currently using the triton
branch. Maybe this script should be in your repo so you can version it in a compatible way?
Hmm, I guess GPTQ-for-LLaMa
doesn't like the older transformers
version. Quite the pickle. Yeah, I've been meaning to absorb the quantizer script.
I'll get a better fix out, but that'll be later. I'm right in the middle of something.
Commit 19c0535d792d7e388e7fe799f8cfa350ce74fa9a
of GPTQ-for-LLaMa should work; that's what I have on my machine at the moment.
Or you can try to patch convert_weights.py
. Then you shouldn't need to re-quantize the weights. Off the top of my head this might work:
Replace this entire block:
if state_dict[name + '.bias'] is not None:
print(f"Converting bias for {name}")
state_dict[name + '.bias'] = state_dict[name + '.bias'].to(torch.float16)
with just:
state_dict[name + '.bias'] = None
Thanks for the suggestion but looks like it needs more surgery:
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/safetensors/torch.py", line 222, in _flatten
raise ValueError(f"Key `{k}` is invalid, expected torch.Tensor but received {type(v)}")
ValueError: Key `model.layers.0.self_attn.q_proj.bias` is invalid, expected torch.Tensor but received <class 'NoneType'>
Going to try running GPTQ again with the older branch of their repo.
This worked, now just trying to figure out how to get it to run on two GPUs...
Nice to hear it's working now. I haven't tried multi-GPU setups yet. At the very least you'll need to do load_quant(filepath, device=None, warmup_autotune=False)
when loading the model, otherwise it'll try to move the model to the GPU itself (1). Then you're welcome to divide the model up for multi-GPU however you'd like. At least text-generation-webui does this:
max_memory = accelerate.utils.get_balanced_memory(model)
device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
print("Using the following device map for the quantized model:", device_map)
# https://huggingface.co/docs/accelerate/package_reference/big_modeling#accelerate.dispatch_model
model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)
(1) FYI: This disables the warmup_autotune during loading, which won't cause problems, but will cause inference speed to initially lag since it's doing autotuning on the fly.
The autotune was an issue so it's nice to be able to disable it and not worry about fixing it.
Using your favorite branch of transformers, I get this error inside fused_attention.py:
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/gptq_triton/fused_attention.py", line 103, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, offset=offset)
File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 132, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Specifically:
query_states.device = cuda:0
key_states.device = cuda:0
cos.device = cuda:1
sin.device = cuda:1
I patched that part and then it fails inside the transformers library, so upgraded to latest version of transformers and hit TypeError: forward() got an unexpected keyword argument 'position_ids'
Seems like a bridge too far, so going to use the GPTQ-for-LLaMa older triton codebase rather than the bleeding edge for my project.
Thank you for giving the project a try and giving valuable feedback. I'll take a look at the multi-GPU stuff when I get a chance.
FYI:
As of the latest commit (3daf413123cd55c600966492775f717eb3ac01e0) nightly transformers is supported again, so no need to futz around with weird versions. I've also added quantized.py
to this repo, so there's no dependency on GPTQ-for-LLaMa to do the quantization, and no need for the conversion script. multi-GPU is still in progress.
Trying to figure out how to use your weight conversion script. This is what I'm trying:
(1) Using GPTQ-for-LLaMa repo to convert LLaMA 65B model to .safetensors file:
$ python llama.py ~/models/llama-hf c4 --wbits 4 --true-sequential --act-order --groupsize -1 --save_safetensors ~/models/llama-hf/llama65b-4bit-nog.safetensors
(2) Using GPTQ-triton repo to convert .safetensors file to compatible version:
$ python convert_weights.py --model ~/models/llama-hf --quant ~/models/llama-hf/llama65b-4bit-nog.safetensors --output ~/models/gptqt_model
Getting this error: