Open sauravm8 opened 1 year ago
I have the same issue with Llama-2-70B-chat-GPTQ model. Have you solved it? Model TheBloke/guanaco-65B-GPTQ works on multiple GPUs without errors.
Yes, use the latest template provided by theBloke, with device='auto'
On Wed, 13 Sept 2023, 12:32 am mlaszko, @.***> wrote:
I have the same issue with Llama-2-70B-chat-GPTQ model. Have you solved it? Model TheBloke/guanaco-65B-GPTQ works on multiple GPUs without errors.
— Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/348#issuecomment-1716264636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE46VG4THA6UOYQGQYOAXUTX2CWTLANCNFSM6AAAAAA3GX3R3U . You are receiving this because you authored the thread.Message ID: @.***>
Where is that template and how to use it?
I tried setting device='auto' in run_localGPT.py and got error.
model = AutoGPTQForCausalLM.from_quantized(
model_id,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device="auto",
use_triton=False,
quantize_config=None,
)
python run_localGPT_API.py
load INSTRUCTOR_Transformer
max_seq_length 512
WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed.
Traceback (most recent call last):
File "//run_localGPT_API.py", line 67, in <module>
LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
File "/run_localGPT.py", line 79, in load_model
model = AutoGPTQForCausalLM.from_quantized(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 82, in from_quantized
return quant_func(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 753, in from_quantized
device = torch.device(device)
RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: auto
Please tell the cuda, pytorch, transformers and autogpt versions.
On Wed, 13 Sept 2023, 2:31 pm mlaszko, @.***> wrote:
Where is that template and how to use it?
I tried setting device='auto' in run_localGPT.py and got error.
model = AutoGPTQForCausalLM.from_quantized( model_id, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, device="auto", use_triton=False, quantize_config=None, )
python run_localGPT_API.py load INSTRUCTOR_Transformer max_seq_length 512 WARNING:auto_gptq.nn_modules.qlinear_old:CUDA extension not installed. Traceback (most recent call last): File "//run_localGPT_API.py", line 67, in
LLM = load_model(device_type=DEVICE_TYPE, model_id=MODEL_ID, model_basename=MODEL_BASENAME) File "/run_localGPT.py", line 79, in load_model model = AutoGPTQForCausalLM.from_quantized( File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 82, in from_quantized return quant_func( File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 753, in from_quantized device = torch.device(device) RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: auto — Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/348#issuecomment-1717231275, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE46VG63LWIQ25F3O6NGXYDX2FY6JANCNFSM6AAAAAA3GX3R3U . You are receiving this because you authored the thread.Message ID: @.***>
CUDA Version: 11.7 torch 2.0.1 transformers 4.33.1 auto-gptq 0.2.2
Update AutoGPTQ to greater than 0.42 and then check TheBloke's page for any GPTQ model and see how he is loading them.
Using this new code load the model and add that object to the HF pipeline.
On Wed, 13 Sept 2023, 4:24 pm mlaszko, @.***> wrote:
CUDA Version: 11.7 torch 2.0.1 transformers 4.33.1 auto-gptq 0.2.2
— Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/348#issuecomment-1717401022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE46VG66CPDQOD3TBNSQQHLX2GGGDANCNFSM6AAAAAA3GX3R3U . You are receiving this because you authored the thread.Message ID: @.***>
This might also have been solved by the latest commit, check that as well.
On Wed, 13 Sept 2023, 5:05 pm Saurav Mukherjee, < @.***> wrote:
Update AutoGPTQ to greater than 0.42 and then check TheBloke's page for any GPTQ model and see how he is loading them.
Using this new code load the model and add that object to the HF pipeline.
On Wed, 13 Sept 2023, 4:24 pm mlaszko, @.***> wrote:
CUDA Version: 11.7 torch 2.0.1 transformers 4.33.1 auto-gptq 0.2.2
— Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/348#issuecomment-1717401022, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE46VG66CPDQOD3TBNSQQHLX2GGGDANCNFSM6AAAAAA3GX3R3U . You are receiving this because you authored the thread.Message ID: @.***>
With AutoGPTQ 0.42 I have error:
python run_localGPT.py
2023-09-13 12:25:57,631 - INFO - run_localGPT.py:180 - Running on: cuda
2023-09-13 12:25:57,631 - INFO - run_localGPT.py:181 - Display Source Documents set to: False
2023-09-13 12:25:57,837 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-09-13 12:26:01,308 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-13 12:26:01,380 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-70B-chat-GPTQ, on: cuda
2023-09-13 12:26:01,381 - INFO - run_localGPT.py:46 - This action can take a few minutes!
2023-09-13 12:26:01,381 - INFO - run_localGPT.py:68 - Using AutoGPTQForCausalLM for quantized models
2023-09-13 12:26:01,699 - INFO - run_localGPT.py:75 - Tokenizer loaded
2023-09-13 12:26:02,440 - INFO - _base.py:827 - lm_head not been quantized, will be ignored when make_quant.
Traceback (most recent call last):
File "//run_localGPT.py", line 246, in <module>
main()
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "//run_localGPT.py", line 209, in main
llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME)
File "//run_localGPT.py", line 77, in load_model
model = AutoGPTQForCausalLM.from_quantized(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
return quant_func(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 902, in from_quantized
cls.fused_attn_module_type.inject_to_model(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 163, in inject_to_model
raise ValueError("Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True.")
ValueError: Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True.
model = AutoGPTQForCausalLM.from_quantized(
model_id,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
#device="auto",
use_triton=False,
quantize_config=None,
)
Try loading a GPTQ model in isolation using the bloke template and see if it is giving the same error.
On Wed, 13 Sept 2023, 6:11 pm mlaszko, @.***> wrote:
With AutoGPTQ 0.42 I have error:
python run_localGPT.py 2023-09-13 12:25:57,631 - INFO - run_localGPT.py:180 - Running on: cuda 2023-09-13 12:25:57,631 - INFO - run_localGPT.py:181 - Display Source Documents set to: False 2023-09-13 12:25:57,837 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large load INSTRUCTOR_Transformer max_seq_length 512 2023-09-13 12:26:01,308 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information. 2023-09-13 12:26:01,380 - INFO - run_localGPT.py:45 - Loading Model: TheBloke/Llama-2-70B-chat-GPTQ, on: cuda 2023-09-13 12:26:01,381 - INFO - run_localGPT.py:46 - This action can take a few minutes! 2023-09-13 12:26:01,381 - INFO - run_localGPT.py:68 - Using AutoGPTQForCausalLM for quantized models 2023-09-13 12:26:01,699 - INFO - run_localGPT.py:75 - Tokenizer loaded 2023-09-13 12:26:02,440 - INFO - _base.py:827 - lm_head not been quantized, will be ignored when make_quant. Traceback (most recent call last): File "//run_localGPT.py", line 246, in
main() File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(args, **kwargs) File "//run_localGPT.py", line 209, in main llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME) File "//run_localGPT.py", line 77, in load_model model = AutoGPTQForCausalLM.from_quantized( File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized return quant_func( File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 902, in from_quantized cls.fused_attn_module_type.inject_to_model( File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 163, in inject_to_model raise ValueError("Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True.") ValueError: Exllama kernel does not support query/key/value fusion with act-order. Please either use inject_fused_attention=False or disable_exllama=True. model = AutoGPTQForCausalLM.from_quantized( model_id, model_basename=model_basename, use_safetensors=True, trust_remote_code=True, #device="auto", use_triton=False, quantize_config=None, )
— Reply to this email directly, view it on GitHub https://github.com/PromtEngineer/localGPT/issues/348#issuecomment-1717554074, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE46VG25H3RALXBRBBTNRDDX2GSWXANCNFSM6AAAAAA3GX3R3U . You are receiving this because you authored the thread.Message ID: @.***>
remove device = 'auto' instead use device_map='auto'
I fixed it by changing in load_models.py
model = AutoGPTQForCausalLM.from_quantized(
model_id,
model_basename=model_basename,
use_safetensors=True,
trust_remote_code=True,
device_map="auto",
use_triton=False,
quantize_config=None,
)
to
model = AutoModelForCausalLM.from_pretrained(model_id,
device_map="auto",
trust_remote_code=False,
revision="main")
Loading the quanitzed TheBloke/Llama-2-70B-chat-GPTQ or TheBloke/Llama-2-70B-GPTQ model across multiple GPUs. The model is getting loaded, but the query is throwing an error
Full Stacktrace:
GPU config A10x2. On loading a smaller model, which is fit in 1 GPU, it can answer questions.
How to utilize more than 1 GPU to use larger models?