ParisNeo / lollms-webui

Lord of Large Language Models Web User Interface
https://parisneo.github.io/lollms-webui/
Apache License 2.0
4.11k stars 522 forks source link

Exllama does not work with cpu only #467

Open johanno opened 6 months ago

johanno commented 6 months ago

Expected Behavior

use cpu on hugging face

Current Behavior

Using device map: cpu
Couldn't load model.
Couldn't load model. Please verify your configuration file at /mnt/games_fast/lollms_data/configs or use the next menu to select a valid model
Binding returned this exception : Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object
Traceback (most recent call last):
  File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/app.py", line 257, in load_model
    model = ModelBuilder(self.binding).get_model()
  File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/binding.py", line 597, in __init__
    self.build_model()
  File "/mnt/games_fast/lollms-webui/lollms-webui/lollms_core/lollms/binding.py", line 600, in build_model
    self.model = self.binding.build_model()
  File "/mnt/games_fast/lollms-webui/lollms-webui/zoos/bindings_zoo/hugging_face/__init__.py", line 209, in build_model
    self.model = AutoModelForCausalLM.from_pretrained(str(model_path),
  File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3784, in from_pretrained
    model = quantizer.post_init_model(model)
  File "/mnt/games_fast/lollms-webui/installer_files/lollms_env/lib/python3.10/site-packages/optimum/gptq/quantizer.py", line 583, in post_init_model
    raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting `disable_exllama=True` in the quantization config object

personal_models_path: /mnt/games_fast/lollms_data/models
Binding name:hugging_face
Model name:WizardCoder-Python-7B-V1.0-GPTQ

Steps to Reproduce

select hugging face select WizardCoder-Python-7B-V1.0-GPTQ select cpu in hugging face settings

Possible Solution

no idea

Context

can't use gpu since 8 GB aren't enough for most good models

Screenshots

If applicable, add screenshots to help explain the issue.

ellevaisellemoe commented 5 months ago

I am pretty sure exllama only works for gpu models. "A standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs".

ParisNeo commented 5 months ago

Hi. Yes I am sorry for that but exllama is a GPU only binding.