PromtEngineer / localGPT

Chat with your documents on your local device using GPT models. No data leaves your device and 100% private.
Apache License 2.0
19.65k stars 2.2k forks source link

Could not find model in TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ #510

Open dportabella opened 10 months ago

dportabella commented 10 months ago

I select this model in constants.py as follows:

MODEL_ID = "TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ"
MODEL_BASENAME = "Wizard-Vicuna-7B-Uncensored-GPTQ-4bit-128g.no-act.order.safetensors"

run_localGPT.py fails with FileNotFoundError: Could not find model in TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ. The models seems to exists: https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ How to solve this?

$ python run_localGPT.py
2023-09-22 04:45:54,152 - INFO - run_localGPT.py:221 - Running on: cuda
2023-09-22 04:45:54,152 - INFO - run_localGPT.py:222 - Display Source Documents set to: False
2023-09-22 04:45:54,152 - INFO - run_localGPT.py:223 - Use history set to: False
2023-09-22 04:45:54,333 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-09-22 04:45:55,868 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-22 04:45:55,900 - INFO - run_localGPT.py:56 - Loading Model: TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ, on: cuda
2023-09-22 04:45:55,900 - INFO - run_localGPT.py:57 - This action can take a few minutes!
2023-09-22 04:45:55,900 - INFO - load_models.py:86 - Using AutoGPTQForCausalLM for quantized models
2023-09-22 04:45:56,105 - INFO - load_models.py:93 - Tokenizer loaded
Traceback (most recent call last):
  File "/home/david/localGPT/run_localGPT.py", line 258, in <module>
    main()
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/localGPT/run_localGPT.py", line 229, in main
    qa = retrieval_qa_pipline(device_type, use_history, promptTemplate_type="llama")
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/localGPT/run_localGPT.py", line 132, in retrieval_qa_pipline
    llm = load_model(device_type, model_id=MODEL_ID, model_basename=MODEL_BASENAME, LOGGING=logging)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/localGPT/run_localGPT.py", line 66, in load_model
    model, tokenizer = load_quantized_model_qptq(model_id, model_basename, device_type, LOGGING)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/localGPT/load_models.py", line 95, in load_quantized_model_qptq
    model = AutoGPTQForCausalLM.from_quantized(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/auto_gptq/modeling/auto.py", line 82, in from_quantized
    return quant_func(
           ^^^^^^^^^^^
  File "/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/auto_gptq/modeling/_base.py", line 698, in from_quantized
    raise FileNotFoundError(f"Could not find model in {model_name_or_path}")
FileNotFoundError: Could not find model in TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ
rcantada commented 10 months ago

Please try MODEL_BASENAME = "model.safetensors"

erswelljustin commented 10 months ago

@dportabella - Did you resolve this? If so could you share how please.

huycke commented 10 months ago

MODEL_ID = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ" MODEL_BASENAME = "model.safetensors"

Making the "model.safetensors" change to the MODEL_BASENAME what was originally something much longer worked for me.

dportabella commented 10 months ago

last snapshot, with:

MODEL_ID = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
MODEL_BASENAME = "model.safetensors"

python run_localGPT.py complains with

/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.

Where do I set do_sample=True? Anyway, I unset temperature in run_localGPT.py, and it complains with


2023-09-24 22:04:36,080 - WARNING - qlinear_old.py:16 - CUDA extension not installed.

It seems the NVIDA and CUDA driver are working correctly:

nvidia-smi  # test
Sun Sep 24 22:03:42 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4070        Off | 00000000:00:10.0 Off |                  N/A |
|  0%   44C    P8              11W / 200W |     10MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      3372      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
(localGPT) david@ubuntu2:~/localGPT$ nvcc --version  # test
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Here it is the console:

$ time python ingest.py           # learn all docs from SOURCE_DOCUMENTS/
2023-09-24 21:50:45,253 - INFO - ingest.py:121 - Loading documents from /home/david/localGPT/SOURCE_DOCUMENTS
2023-09-24 21:50:45,259 - INFO - ingest.py:34 - Loading document batch
2023-09-24 21:50:46,258 - INFO - ingest.py:130 - Loaded 1 documents from /home/david/localGPT/SOURCE_DOCUMENTS
2023-09-24 21:50:46,259 - INFO - ingest.py:131 - Split into 195 chunks of text
2023-09-24 21:50:46,632 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
2023-09-24 21:50:46,664 - INFO - instantiator.py:21 - Created a temporary directory at /tmp/tmpb3ulsgqr
2023-09-24 21:50:46,664 - INFO - instantiator.py:76 - Writing /tmp/tmpb3ulsgqr/_remote_module_non_scriptable.py
max_seq_length  512

real    0m9,623s
user    0m11,006s
sys     0m3,298s

$ python run_localGPT.py   
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:221 - Running on: cuda
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:222 - Display Source Documents set to: False
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:223 - Use history set to: False
2023-09-24 22:04:33,432 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length  512
2023-09-24 22:04:35,043 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-24 22:04:35,079 - INFO - run_localGPT.py:56 - Loading Model: TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ, on: cuda
2023-09-24 22:04:35,079 - INFO - run_localGPT.py:57 - This action can take a few minutes!
2023-09-24 22:04:35,079 - INFO - load_models.py:86 - Using AutoGPTQForCausalLM for quantized models
2023-09-24 22:04:35,563 - INFO - load_models.py:93 - Tokenizer loaded
2023-09-24 22:04:36,080 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.
2023-09-24 22:04:36,080 - WARNING - qlinear_old.py:16 - CUDA extension not installed.
2023-09-24 22:04:36,895 - INFO - modeling.py:795 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2023-09-24 22:04:38,689 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-09-24 22:04:38,814 - INFO - run_localGPT.py:89 - Local LLM Loaded

Enter a query: list me the articles of the US constitution

> Question:
list me the articles of the US constitution

> Answer:

Enter a query: 
rcantada commented 10 months ago

@dportabella

I inserted do_sample=True in line 82 of run_localGPT.py

    # Create a pipeline for text generation
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_length=MAX_NEW_TOKENS,
      # do_sample warning
        do_sample=True,
        temperature=0.2,
        top_p=0.95,
        repetition_penalty=1.15,
        generation_config=generation_config,
    )

But honestly I have no idea what sampling does to its performance.

rcantada commented 10 months ago

@dportabella

The WARNING - qlinear_old.py:16 - CUDA extension not installed appears to be an issue with auto-gtpq. I solved this with uninstalling and reisntalling it from source:

pip uninstall -y auto-gptq
GITHUB_ACTIONS=true pip install auto-gptq==0.2.2 --no-cache-dir

but I think it assumes that all build requirement for auto-gptq are met including cuda, cudnn, pytorch etc.

dportabella commented 9 months ago

reinstalling auto-gptq as mentioned, it fails on my Ubuntu 22.04.3 LTS. Any idea?

$ git clone https://github.com/PromtEngineer/localGPT.git
$ cd localGPT
$ conda create -y --name localGPT python=3.11.3
$ conda activate localGPT
$ conda install python=3.11.3
$ pip install -r requirements.txt

$ pip uninstall -y auto-gptq

$ GITHUB_ACTIONS=true pip install auto-gptq==0.2.2 --no-cache-dir
Collecting auto-gptq==0.2.2
  Building wheel for auto-gptq (setup.py) ... error
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for auto-gptq
  Running setup.py clean for auto-gptq
Failed to build auto-gptq

my system:

$ inxi -Fxz
Mobo: N/A model: N/A serial: N/A BIOS: SeaBIOS
    v: rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org date: 04/01/2014
CPU:
  Info: 20-core model: AMD Ryzen 9 7900 bits: 64 type: MCP arch: Zen 3 rev: 2 cache: L1: 2.5 MiB
    L2: 10 MiB L3: 16 MiB
  Speed (MHz): avg: 3693 min/max: N/A cores: 1: 3693 2: 3693 3: 3693 4: 3693 5: 3693 6: 3693
    7: 3693 8: 3693 9: 3693 10: 3693 11: 3693 12: 3693 13: 3693 14: 3693 15: 3693 16: 3693 17: 3693
    18: 3693 19: 3693 20: 3693 bogomips: 147721
  Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
  Device-1: vendor: Red Hat driver: bochs-drm v: N/A bus-ID: 00:02.0
  Device-2: NVIDIA vendor: Gigabyte driver: nvidia v: 535.104.05 bus-ID: 00:10.0
  Display: server: X.org v: 1.21.1.4 with: Xwayland v: 22.1.1 driver: X:
    loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa gpu: bochs-drm tty: 193x54
    resolution: 1280x800
  Message: GL data unavailable in console. Try -G --display
Audio:
  Device-1: NVIDIA vendor: Gigabyte driver: snd_hda_intel v: kernel bus-ID: 00:10.1
  Sound Server-1: ALSA v: k6.2.0-33-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Intel 82371AB/EB/MB PIIX4 ACPI vendor: Red Hat Qemu virtual machine
    type: network bridge driver: piix4_smbus v: N/A port: N/A bus-ID: 00:01.3
  Device-2: Red Hat Virtio network driver: virtio-pci v: 1 port: f0e0 bus-ID: 00:12.0
  IF: ens18 state: up speed: -1 duplex: unknown mac: <filter>
Drives:
  Local Storage: total: 1000 GiB used: 124.47 GiB (12.4%)
  ID-1: /dev/sda vendor: QEMU model: HARDDISK size: 1000 GiB
Partition:
  ID-1: / size: 982.73 GiB used: 124.47 GiB (12.7%) fs: ext4 dev: /dev/sda3
  ID-2: /boot/efi size: 512 MiB used: 6.1 MiB (1.2%) fs: vfat dev: /dev/sda2
Swap:
  ID-1: swap-1 type: file size: 2 GiB used: 88.3 MiB (4.3%) file: /swapfile
Sensors:
  System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 39 C
  Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
  Processes: 339 Uptime: 19h 59m Memory: 19.06 GiB used: 1.22 GiB (6.4%) Init: systemd
  runlevel: 5 Compilers: gcc: 11.4.0 Packages: 1791 Shell: Bash v: 5.1.16 inxi: 3.3.13
rcantada commented 9 months ago

@dportabella

It is difficult to say why auto-gtpq is failing because it only says "error". It did not even complain about pytorch or cuda.

My step-by-step installation in Ubuntu 22.04 is at https://github.com/PromtEngineer/localGPT/discussions/521#discussion-5663719 But that is built around cuda-toolkit 11.7, if you are on cuda-toolkit 11.8 you would need to replace the pytorch install with:

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

and AFAIK this install of pytorch comes with its own cuda, so check what cudnn it is paired with with conda list .

If it is of any use, the difference between our Ubuntu setup (other than your hardware being better) is that I'm on KDE and xorg, and you seem to be on wayland. And my nvidia driver is 535.86.05 from repo.

bookandlover commented 9 months ago

I already have a model locally, which I have placed it under the "model" directory of localGPT. However, every time I run the program, it still attempts to download the model. How can I resolve this issue? The model I have chosen is identified as MODEL_ID = "shaowenchen/chinese-alpaca-2-13b-16k-gguf" and its BASENAME is "chinese-alpaca-2-13b-16k.Q6_K.gguf".