Can't Generate With 4bit Quantized Model

chigkim commented 1 year ago

I cloned latestgptq branch with --recurse-submodules flag. git clone https://github.com/0cc4m/KoboldAI -b latestgptq --recurse-submodules I quantized a model using the gptq inside repos. python llama.py models/test c4 --wbits 4 --true-sequential --act-order --save_safetensors models/test/4bit.safetensors I can manually run the inference. python repos/gptq/llama_inference.py models/test --wbits 4 --load models/test/4bit.safetensors --text "Once upon a time, " It also looks like it loads fine with KoboldAI.

loading Model
INIT       | Searching  | GPU support
WARNING    | __main__:load_model:2882 - This model does not support hybrid generation. --breakmodel_gpulayers will be ignored.
INIT       | Found      | GPU support
INIT       | Starting   | Transformers
4-bit CPU offloader active
Using 4-bit file: /content/KoboldAI/models/test/4bit.safetensors, groupsize -1
Trying to load llama model in 4-bit
Loading model ...
/usr/local/lib/python3.10/dist-packages/safetensors/torch.py:99: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  with safe_open(filename, framework="pt", device=device) as f:
/usr/local/lib/python3.10/dist-packages/torch/_utils.py:776: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/dist-packages/torch/storage.py:899: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  storage = cls(wrap_storage=untyped_storage)
Done.
INFO       | __main__:load_model:3396 - Pipeline created: test
INIT       | Starting   | LUA bridge
INIT       | OK         | LUA bridge
INIT       | Starting   | LUA Scripts
INIT       | OK         | LUA Scripts
Setting Seed

However, I get an error when I try to submit text from KoboldAI and generate.

ERROR      | __main__:generate:6516 - Traceback (most recent call last):
  File "/content/KoboldAI/aiserver.py", line 6503, in generate
    genout, already_generated = tpool.execute(core_generate, txt, minimum, maximum, found_entries)
  File "/usr/local/lib/python3.10/dist-packages/eventlet/tpool.py", line 132, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python3.10/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.10/dist-packages/eventlet/tpool.py", line 86, in tworker
    rv = meth(*args, **kwargs)
  File "/content/KoboldAI/aiserver.py", line 5682, in core_generate
    result = raw_generate(
  File "/content/KoboldAI/aiserver.py", line 5910, in raw_generate
    batch_encoded = torch_raw_generate(
  File "/content/KoboldAI/aiserver.py", line 6006, in torch_raw_generate
    genout = generator(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1462, in generate
    return self.sample(
  File "/content/KoboldAI/aiserver.py", line 2456, in new_sample
    return new_sample.old_sample(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2478, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/KoboldAI/repos/gptq/offload.py", line 225, in llama_offload_forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids'

chigkim commented 1 year ago

If you quantize with gptq submodule using --groupsize 128 and run inference, you get garbage output. It got fixed in latest qwopqwop200/GPTQ-for-LLaMa cuda branch. If you quantize using latest qwopqwop200/GPTQ-for-LLaMa cuda branch with --groupsize 128 flag, you don't get garbage during inference. Could you update the submodule and integrate the fix? Thanks!

0cc4m commented 1 year ago

The error TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids' is most likely the wrong transformers version.

You only get garbage outputs if you use groupsize and act-order together, which is mentioned in the readme of my GPTQ fork. I will not update it yet as I had very bad performance using upstream GPTQ last time I tested it. We are on a state here that works well and is fast, while qwopwop is focused on better perplexity results, which aren't that important to KoboldAI users, even at the cost of performance or compatibility.

chigkim commented 1 year ago

I quantized using the gptq that comes with submodule of 0cc4m/KoboldAI. KoboldAI is not using that to run inference? I can run repos/gptq/llama_inference.py without the error. I'm also installing transformers==4.28.0 which was specified in requirements.txt. What version should I install instead? Thanks for your help!

0cc4m commented 1 year ago

4.28.0 is correct, and so is using the GPTQ version in repos/gptq. Do you still get that TypeError or something else now?

chigkim commented 1 year ago

Unfortunately it's same TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids'

0cc4m commented 1 year ago

You must have an outdated transformers version as LlamaDecoderLayer.forward() does indeed have that parameter.

chigkim commented 1 year ago

Yep, my bad! That was it! After I pulled today, I didn't run the pip install -r requirements.txt! By the way, is there a flag that I need to pass to aiserver.py in order to enable api for TavernAI to connect? I can open the link from the browser, but I can't seemed to use the same link to connect from TavernAI. Also, is there a way to load 4bit model on Colab without using UI? I play with --model --path, but no luck. My model is at: models/test/4bit.safetensors. Thanks so much for your help!

chigkim commented 1 year ago

Actually I got the api to work. I just needed to use the first link not the second link with /new_ui, and then ad /api. Now I just need to figure out how to automatically load the 4bit model when aiserver.py starts without using UI. I'd appreciate your help!

0cc4m commented 1 year ago

I don't think that works in the latestgptq branch yet, but it might work in the model-structure-update branch if you run it with --model modelname where modelname is the name of the folder in models/

chigkim commented 1 year ago

Thanks for the info.

0cc4m commented 1 year ago

Should work in the latestgptq branch now, too.

chigkim commented 1 year ago

Thanks, but now it gives me an error No module named 'hf_bleeding_edge'. !pip install hf_bleeding_edge doesn't work either. What module is that, and where can I get it? It's not in requirements.txt?

0cc4m commented 1 year ago

The reliable way to check which packages you need is the environments/huggingface.yml file, not the requirements.txt. In this case, you need https://github.com/0cc4m/hf_bleeding_edge , which can be installed with pip.

0cc4m / KoboldAI

Can't Generate With 4bit Quantized Model #19