Open dportabella opened 10 months ago
Please try MODEL_BASENAME = "model.safetensors"
@dportabella - Did you resolve this? If so could you share how please.
MODEL_ID = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ" MODEL_BASENAME = "model.safetensors"
Making the "model.safetensors" change to the MODEL_BASENAME what was originally something much longer worked for me.
last snapshot, with:
MODEL_ID = "TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ"
MODEL_BASENAME = "model.safetensors"
python run_localGPT.py
complains with
/home/david/anaconda3/envs/localGPT/lib/python3.11/site-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.2` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
Where do I set do_sample=True
?
Anyway, I unset temperature
in run_localGPT.py
, and it complains with
2023-09-24 22:04:36,080 - WARNING - qlinear_old.py:16 - CUDA extension not installed.
It seems the NVIDA and CUDA driver are working correctly:
nvidia-smi # test
Sun Sep 24 22:03:42 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4070 Off | 00000000:00:10.0 Off | N/A |
| 0% 44C P8 11W / 200W | 10MiB / 12282MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3372 G /usr/lib/xorg/Xorg 4MiB |
+---------------------------------------------------------------------------------------+
(localGPT) david@ubuntu2:~/localGPT$ nvcc --version # test
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
Here it is the console:
$ time python ingest.py # learn all docs from SOURCE_DOCUMENTS/
2023-09-24 21:50:45,253 - INFO - ingest.py:121 - Loading documents from /home/david/localGPT/SOURCE_DOCUMENTS
2023-09-24 21:50:45,259 - INFO - ingest.py:34 - Loading document batch
2023-09-24 21:50:46,258 - INFO - ingest.py:130 - Loaded 1 documents from /home/david/localGPT/SOURCE_DOCUMENTS
2023-09-24 21:50:46,259 - INFO - ingest.py:131 - Split into 195 chunks of text
2023-09-24 21:50:46,632 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
2023-09-24 21:50:46,664 - INFO - instantiator.py:21 - Created a temporary directory at /tmp/tmpb3ulsgqr
2023-09-24 21:50:46,664 - INFO - instantiator.py:76 - Writing /tmp/tmpb3ulsgqr/_remote_module_non_scriptable.py
max_seq_length 512
real 0m9,623s
user 0m11,006s
sys 0m3,298s
$ python run_localGPT.py
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:221 - Running on: cuda
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:222 - Display Source Documents set to: False
2023-09-24 22:04:33,261 - INFO - run_localGPT.py:223 - Use history set to: False
2023-09-24 22:04:33,432 - INFO - SentenceTransformer.py:66 - Load pretrained SentenceTransformer: hkunlp/instructor-large
load INSTRUCTOR_Transformer
max_seq_length 512
2023-09-24 22:04:35,043 - INFO - posthog.py:16 - Anonymized telemetry enabled. See https://docs.trychroma.com/telemetry for more information.
2023-09-24 22:04:35,079 - INFO - run_localGPT.py:56 - Loading Model: TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ, on: cuda
2023-09-24 22:04:35,079 - INFO - run_localGPT.py:57 - This action can take a few minutes!
2023-09-24 22:04:35,079 - INFO - load_models.py:86 - Using AutoGPTQForCausalLM for quantized models
2023-09-24 22:04:35,563 - INFO - load_models.py:93 - Tokenizer loaded
2023-09-24 22:04:36,080 - INFO - _base.py:727 - lm_head not been quantized, will be ignored when make_quant.
2023-09-24 22:04:36,080 - WARNING - qlinear_old.py:16 - CUDA extension not installed.
2023-09-24 22:04:36,895 - INFO - modeling.py:795 - We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2023-09-24 22:04:38,689 - WARNING - fused_llama_mlp.py:306 - skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
The model 'LlamaGPTQForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
2023-09-24 22:04:38,814 - INFO - run_localGPT.py:89 - Local LLM Loaded
Enter a query: list me the articles of the US constitution
> Question:
list me the articles of the US constitution
> Answer:
Enter a query:
@dportabella
I inserted do_sample=True in line 82 of run_localGPT.py
# Create a pipeline for text generation
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=MAX_NEW_TOKENS,
# do_sample warning
do_sample=True,
temperature=0.2,
top_p=0.95,
repetition_penalty=1.15,
generation_config=generation_config,
)
But honestly I have no idea what sampling does to its performance.
@dportabella
The WARNING - qlinear_old.py:16 - CUDA extension not installed appears to be an issue with auto-gtpq. I solved this with uninstalling and reisntalling it from source:
pip uninstall -y auto-gptq
GITHUB_ACTIONS=true pip install auto-gptq==0.2.2 --no-cache-dir
but I think it assumes that all build requirement for auto-gptq are met including cuda, cudnn, pytorch etc.
reinstalling auto-gptq as mentioned, it fails on my Ubuntu 22.04.3 LTS. Any idea?
$ git clone https://github.com/PromtEngineer/localGPT.git
$ cd localGPT
$ conda create -y --name localGPT python=3.11.3
$ conda activate localGPT
$ conda install python=3.11.3
$ pip install -r requirements.txt
$ pip uninstall -y auto-gptq
$ GITHUB_ACTIONS=true pip install auto-gptq==0.2.2 --no-cache-dir
Collecting auto-gptq==0.2.2
Building wheel for auto-gptq (setup.py) ... error
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for auto-gptq
Running setup.py clean for auto-gptq
Failed to build auto-gptq
my system:
$ inxi -Fxz
Mobo: N/A model: N/A serial: N/A BIOS: SeaBIOS
v: rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org date: 04/01/2014
CPU:
Info: 20-core model: AMD Ryzen 9 7900 bits: 64 type: MCP arch: Zen 3 rev: 2 cache: L1: 2.5 MiB
L2: 10 MiB L3: 16 MiB
Speed (MHz): avg: 3693 min/max: N/A cores: 1: 3693 2: 3693 3: 3693 4: 3693 5: 3693 6: 3693
7: 3693 8: 3693 9: 3693 10: 3693 11: 3693 12: 3693 13: 3693 14: 3693 15: 3693 16: 3693 17: 3693
18: 3693 19: 3693 20: 3693 bogomips: 147721
Flags: avx avx2 ht lm nx pae sse sse2 sse3 sse4_1 sse4_2 sse4a ssse3 svm
Graphics:
Device-1: vendor: Red Hat driver: bochs-drm v: N/A bus-ID: 00:02.0
Device-2: NVIDIA vendor: Gigabyte driver: nvidia v: 535.104.05 bus-ID: 00:10.0
Display: server: X.org v: 1.21.1.4 with: Xwayland v: 22.1.1 driver: X:
loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa gpu: bochs-drm tty: 193x54
resolution: 1280x800
Message: GL data unavailable in console. Try -G --display
Audio:
Device-1: NVIDIA vendor: Gigabyte driver: snd_hda_intel v: kernel bus-ID: 00:10.1
Sound Server-1: ALSA v: k6.2.0-33-generic running: yes
Sound Server-2: PulseAudio v: 15.99.1 running: yes
Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
Device-1: Intel 82371AB/EB/MB PIIX4 ACPI vendor: Red Hat Qemu virtual machine
type: network bridge driver: piix4_smbus v: N/A port: N/A bus-ID: 00:01.3
Device-2: Red Hat Virtio network driver: virtio-pci v: 1 port: f0e0 bus-ID: 00:12.0
IF: ens18 state: up speed: -1 duplex: unknown mac: <filter>
Drives:
Local Storage: total: 1000 GiB used: 124.47 GiB (12.4%)
ID-1: /dev/sda vendor: QEMU model: HARDDISK size: 1000 GiB
Partition:
ID-1: / size: 982.73 GiB used: 124.47 GiB (12.7%) fs: ext4 dev: /dev/sda3
ID-2: /boot/efi size: 512 MiB used: 6.1 MiB (1.2%) fs: vfat dev: /dev/sda2
Swap:
ID-1: swap-1 type: file size: 2 GiB used: 88.3 MiB (4.3%) file: /swapfile
Sensors:
System Temperatures: cpu: N/A mobo: N/A gpu: nvidia temp: 39 C
Fan Speeds (RPM): N/A gpu: nvidia fan: 0%
Info:
Processes: 339 Uptime: 19h 59m Memory: 19.06 GiB used: 1.22 GiB (6.4%) Init: systemd
runlevel: 5 Compilers: gcc: 11.4.0 Packages: 1791 Shell: Bash v: 5.1.16 inxi: 3.3.13
@dportabella
It is difficult to say why auto-gtpq is failing because it only says "error". It did not even complain about pytorch or cuda.
My step-by-step installation in Ubuntu 22.04 is at https://github.com/PromtEngineer/localGPT/discussions/521#discussion-5663719 But that is built around cuda-toolkit 11.7, if you are on cuda-toolkit 11.8 you would need to replace the pytorch install with:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
and AFAIK this install of pytorch comes with its own cuda, so check what cudnn it is paired with with conda list .
If it is of any use, the difference between our Ubuntu setup (other than your hardware being better) is that I'm on KDE and xorg, and you seem to be on wayland. And my nvidia driver is 535.86.05 from repo.
I already have a model locally, which I have placed it under the "model" directory of localGPT. However, every time I run the program, it still attempts to download the model. How can I resolve this issue? The model I have chosen is identified as MODEL_ID = "shaowenchen/chinese-alpaca-2-13b-16k-gguf" and its BASENAME is "chinese-alpaca-2-13b-16k.Q6_K.gguf".
I select this model in constants.py as follows:
run_localGPT.py
fails withFileNotFoundError: Could not find model in TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ
. The models seems to exists: https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GPTQ How to solve this?