fpgaminer / GPTQ-triton

GPTQ inference Triton kernel
Apache License 2.0
278 stars 21 forks source link

Weight conversion help #8

Closed catid closed 1 year ago

catid commented 1 year ago

Trying to figure out how to use your weight conversion script. This is what I'm trying:

(1) Using GPTQ-for-LLaMa repo to convert LLaMA 65B model to .safetensors file:

$ python llama.py ~/models/llama-hf c4 --wbits 4 --true-sequential --act-order --groupsize -1 --save_safetensors ~/models/llama-hf/llama65b-4bit-nog.safetensors

(2) Using GPTQ-triton repo to convert .safetensors file to compatible version:

$ python convert_weights.py --model ~/models/llama-hf --quant ~/models/llama-hf/llama65b-4bit-nog.safetensors --output ~/models/gptqt_model

Getting this error:

/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
Traceback (most recent call last):
  File "/home/catid/sources/GPTQ-triton/convert_weights.py", line 87, in <module>
    main()
  File "/home/catid/sources/GPTQ-triton/convert_weights.py", line 58, in main
    if state_dict[name + '.bias'] is not None:
KeyError: 'model.layers.0.self_attn.q_proj.bias'
fpgaminer commented 1 year ago

GPTQ-triton is not currently compatible with transformers HEAD. You'll need to use a transformers commit before 7dcd870. For example, pip install git+https://github.com/huggingface/transformers.git@5506d0496957cde19318eee3d34ee682b654abe8.

I've been holding off on compatibility because there's a performance regression in transformers after commit 7dcd870.

Sorry for the rough edge here.

catid commented 1 year ago

Looks like you mean that I should run the llama.py GPTQ script again after changing to this version of transformers, since that doesn't fix the convert_weights.py script. It's crunching...

fpgaminer commented 1 year ago

Right, yes. Let me know if that fixes things.

catid commented 1 year ago

This seems to emit a different error:

(gptq) ➜  GPTQ-for-LLaMa git:(triton) ✗ python llama.py ~/models/llama-hf c4 --wbits 4 --true-sequential --act-order --groupsize -1 --save_safetensors llama65b-4bit-nog-old-transformers.safetensors
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████| 14/14 [01:53<00:00,  8.12s/it]
Found cached dataset json (/home/catid/.cache/huggingface/datasets/allenai___json/allenai--c4-6fbe877195f42de5/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Found cached dataset json (/home/catid/.cache/huggingface/datasets/allenai___json/allenai--c4-efc3d4f4606f44bd/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Starting ...
Traceback (most recent call last):
  File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 457, in <module>
    quantizers = llama_sequential(model, dataloader, DEV)
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 54, in llama_sequential
    model(batch[0].to(dev))
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/catid/sources/text-generation-webui/repositories/GPTQ-for-LLaMa/llama.py", line 49, in forward
    cache['position_ids'] = kwargs['position_ids']
KeyError: 'position_ids'
catid commented 1 year ago

Should I be on a specific branch of GPTQ-for-LLaMa? Currently using the triton branch. Maybe this script should be in your repo so you can version it in a compatible way?

fpgaminer commented 1 year ago

Hmm, I guess GPTQ-for-LLaMa doesn't like the older transformers version. Quite the pickle. Yeah, I've been meaning to absorb the quantizer script.

I'll get a better fix out, but that'll be later. I'm right in the middle of something.

Commit 19c0535d792d7e388e7fe799f8cfa350ce74fa9a of GPTQ-for-LLaMa should work; that's what I have on my machine at the moment.

Or you can try to patch convert_weights.py. Then you shouldn't need to re-quantize the weights. Off the top of my head this might work:

Replace this entire block:

        if state_dict[name + '.bias'] is not None:
            print(f"Converting bias for {name}")
            state_dict[name + '.bias'] = state_dict[name + '.bias'].to(torch.float16)

with just:

        state_dict[name + '.bias'] = None
catid commented 1 year ago

Thanks for the suggestion but looks like it needs more surgery:

  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/safetensors/torch.py", line 222, in _flatten
    raise ValueError(f"Key `{k}` is invalid, expected torch.Tensor but received {type(v)}")
ValueError: Key `model.layers.0.self_attn.q_proj.bias` is invalid, expected torch.Tensor but received <class 'NoneType'>

Going to try running GPTQ again with the older branch of their repo.

catid commented 1 year ago

This worked, now just trying to figure out how to get it to run on two GPUs...

fpgaminer commented 1 year ago

Nice to hear it's working now. I haven't tried multi-GPU setups yet. At the very least you'll need to do load_quant(filepath, device=None, warmup_autotune=False) when loading the model, otherwise it'll try to move the model to the GPU itself (1). Then you're welcome to divide the model up for multi-GPU however you'd like. At least text-generation-webui does this:

                max_memory = accelerate.utils.get_balanced_memory(model)

            device_map = accelerate.infer_auto_device_map(model, max_memory=max_memory, no_split_module_classes=["LlamaDecoderLayer"])
            print("Using the following device map for the quantized model:", device_map)
            # https://huggingface.co/docs/accelerate/package_reference/big_modeling#accelerate.dispatch_model
            model = accelerate.dispatch_model(model, device_map=device_map, offload_buffers=True)

(1) FYI: This disables the warmup_autotune during loading, which won't cause problems, but will cause inference speed to initially lag since it's doing autotuning on the fly.

catid commented 1 year ago

The autotune was an issue so it's nice to be able to disable it and not worry about fixing it.

Using your favorite branch of transformers, I get this error inside fused_attention.py:

  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/gptq_triton/fused_attention.py", line 103, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, offset=offset)
  File "/home/catid/mambaforge/envs/gptq/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 132, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
catid commented 1 year ago

Specifically:

query_states.device = cuda:0
key_states.device = cuda:0
cos.device = cuda:1
sin.device = cuda:1
catid commented 1 year ago

I patched that part and then it fails inside the transformers library, so upgraded to latest version of transformers and hit TypeError: forward() got an unexpected keyword argument 'position_ids'

Seems like a bridge too far, so going to use the GPTQ-for-LLaMa older triton codebase rather than the bleeding edge for my project.

fpgaminer commented 1 year ago

Thank you for giving the project a try and giving valuable feedback. I'll take a look at the multi-GPU stuff when I get a chance.

fpgaminer commented 1 year ago

FYI:

As of the latest commit (3daf413123cd55c600966492775f717eb3ac01e0) nightly transformers is supported again, so no need to futz around with weird versions. I've also added quantized.py to this repo, so there's no dependency on GPTQ-for-LLaMa to do the quantization, and no need for the conversion script. multi-GPU is still in progress.