Closed chigkim closed 1 year ago
If you quantize with gptq submodule using --groupsize 128 and run inference, you get garbage output. It got fixed in latest qwopqwop200/GPTQ-for-LLaMa cuda branch. If you quantize using latest qwopqwop200/GPTQ-for-LLaMa cuda branch with --groupsize 128 flag, you don't get garbage during inference. Could you update the submodule and integrate the fix? Thanks!
The error TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids'
is most likely the wrong transformers version.
You only get garbage outputs if you use groupsize and act-order together, which is mentioned in the readme of my GPTQ fork. I will not update it yet as I had very bad performance using upstream GPTQ last time I tested it. We are on a state here that works well and is fast, while qwopwop is focused on better perplexity results, which aren't that important to KoboldAI users, even at the cost of performance or compatibility.
I quantized using the gptq that comes with submodule of 0cc4m/KoboldAI. KoboldAI is not using that to run inference? I can run repos/gptq/llama_inference.py without the error. I'm also installing transformers==4.28.0 which was specified in requirements.txt. What version should I install instead? Thanks for your help!
4.28.0 is correct, and so is using the GPTQ version in repos/gptq
. Do you still get that TypeError or something else now?
Unfortunately it's same TypeError: LlamaDecoderLayer.forward() got an unexpected keyword argument 'position_ids'
You must have an outdated transformers version as LlamaDecoderLayer.forward()
does indeed have that parameter.
Yep, my bad! That was it! After I pulled today, I didn't run the pip install -r requirements.txt! By the way, is there a flag that I need to pass to aiserver.py in order to enable api for TavernAI to connect? I can open the link from the browser, but I can't seemed to use the same link to connect from TavernAI. Also, is there a way to load 4bit model on Colab without using UI? I play with --model --path, but no luck. My model is at: models/test/4bit.safetensors. Thanks so much for your help!
Actually I got the api to work. I just needed to use the first link not the second link with /new_ui, and then ad /api. Now I just need to figure out how to automatically load the 4bit model when aiserver.py starts without using UI. I'd appreciate your help!
I don't think that works in the latestgptq
branch yet, but it might work in the model-structure-update
branch if you run it with --model modelname
where modelname
is the name of the folder in models/
Thanks for the info.
Should work in the latestgptq branch now, too.
Thanks, but now it gives me an error No module named 'hf_bleeding_edge'
.
!pip install hf_bleeding_edge
doesn't work either.
What module is that, and where can I get it?
It's not in requirements.txt?
The reliable way to check which packages you need is the environments/huggingface.yml
file, not the requirements.txt
. In this case, you need https://github.com/0cc4m/hf_bleeding_edge , which can be installed with pip.
I cloned latestgptq branch with --recurse-submodules flag.
git clone https://github.com/0cc4m/KoboldAI -b latestgptq --recurse-submodules
I quantized a model using the gptq inside repos.python llama.py models/test c4 --wbits 4 --true-sequential --act-order --save_safetensors models/test/4bit.safetensors
I can manually run the inference.python repos/gptq/llama_inference.py models/test --wbits 4 --load models/test/4bit.safetensors --text "Once upon a time, "
It also looks like it loads fine with KoboldAI.However, I get an error when I try to submit text from KoboldAI and generate.