abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.85k stars 939 forks source link

Illegal instruction (core dumped) when using llama_cpp_python-0.2.81 with text-generation-webui #1578

Closed 9600- closed 2 months ago

9600- commented 3 months ago

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Attempt to load GGUF, or report an error.

Current Behavior

19:46:40-923010 INFO     Starting Text generation web UI
19:46:40-927388 INFO     Loading settings from "settings.yaml"
19:46:40-930253 INFO     Loading the extension "openai"
19:46:41-059825 INFO     OpenAI-compatible API URL:
http://0.0.0.0:5000
Running on local URL:  http://0.0.0.0:7860
19:49:19-606314 INFO     Loading "Meta-Llama-3-70B-Instruct_fixed.Q8_0.gguf"
19:49:20-073075 INFO     llama.cpp weights detected: "models/Meta-Llama-3-70B-Instruct_fixed.Q8_0.gguf"
Illegal instruction (core dumped)

Switching back to llama_cpp_python_cuda-0.2.79 resolves the issue.

I can build llama_cpp_python_cuda-0.2.81 successfully using CMAKE_ARGS="-DLLAVA_BUILD=OFF", but receive the following error in TGW during model loads.

This error message references #1575.

Exception: Cannot import 'llama_cpp_cuda' because 'llama_cpp' is already imported. See issue #1575 in llama-cpp-python. Please restart the server before attempting to use a different version of llama-cpp-python.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  48
  On-line CPU(s) list:   0-47
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
    CPU family:          6
    Model:               62
    Thread(s) per core:  2
    Core(s) per socket:  12
    Socket(s):           2
    Stepping:            4
    CPU max MHz:         3500.0000
    CPU min MHz:         1200.0000
    BogoMIPS:            5399.65
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
                          nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dte
                         s64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f
                         16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
                          xsaveopt dtherm ida arat pln pts md_clear flush_l1d
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   768 KiB (24 instances)
  L1i:                   768 KiB (24 instances)
  L2:                    6 MiB (24 instances)
  L3:                    60 MiB (2 instances)
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-11,24-35
  NUMA node1 CPU(s):     12-23,36-47
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec rstack overflow:  Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:    22.04
Codename:   jammy
$ python3 --version

Python 3.11.9

$ make --version

GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

$ g++ --version

g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

nvidia-smi --version

NVIDIA-SMI version  : 550.90.07
NVML version        : 550.90
DRIVER version      : 550.90.07
CUDA Version        : 12.4

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. git clone https://github.com/oobabooga/text-generation-webui.git
  2. cd text-generation-webui
  3. ./start_linux.sh
  4. Load WebUI, attempt to load GGUF model.
  5. Immediate crash with Illegal instruction (core dumped)

To see build error:

  1. conda activate text-generation-webui/installer_files/env
  2. pip uninstall -y llama_cpp_python llama_cpp_python_cuda
  3. set CMAKE_ARGS="-DGGML_CUDA=ON"
  4. set FORCE_CMAKE=1
  5. pip install llama-cpp-python --no-cache-dir

Failure Logs

When trying to load model:

19:46:40-923010 INFO     Starting Text generation web UI
19:46:40-927388 INFO     Loading settings from "settings.yaml"
19:46:40-930253 INFO     Loading the extension "openai"
19:46:41-059825 INFO     OpenAI-compatible API URL:
http://0.0.0.0:5000
Running on local URL:  http://0.0.0.0:7860
19:49:19-606314 INFO     Loading "Meta-Llama-3-70B-Instruct_fixed.Q8_0.gguf"
19:49:20-073075 INFO     llama.cpp weights detected: "models/Meta-Llama-3-70B-Instruct_fixed.Q8_0.gguf"
Illegal instruction (core dumped)

When trying to build:

      FAILED: vendor/llama.cpp/examples/llava/llama-llava-cli
      : && /usr/bin/g++  -pthread -B ~/text-generation-webui/installer_files/env/compiler_compat -O3 -DNDEBUG  vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/llava.cpp.o vendor/llama.cpp/examples/llava/CMakeFiles/llava.dir/clip.cpp.o vendor/llama.cpp/examples/llava/CMakeFiles/llama-llava-cli.dir/llava-cli.cpp.o -o vendor/llama.cpp/examples/llava/llama-llava-cli  -Wl,-rpath,/tmp/tmp3wy_8hxj/build/vendor/llama.cpp/src:/tmp/tmp3wy_8hxj/build/vendor/llama.cpp/ggml/src:  vendor/llama.cpp/common/libcommon.a  vendor/llama.cpp/src/libllama.so  vendor/llama.cpp/ggml/src/libggml.so && :
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: warning: libgomp.so.1, needed by vendor/llama.cpp/ggml/src/libggml.so, not found (try using -rpath or -rpath-link)
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: vendor/llama.cpp/ggml/src/libggml.so: undefined reference to `GOMP_barrier@GOMP_1.0'
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: vendor/llama.cpp/ggml/src/libggml.so: undefined reference to `GOMP_parallel@GOMP_4.0'
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: vendor/llama.cpp/ggml/src/libggml.so: undefined reference to `omp_get_thread_num@OMP_1.0'
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: vendor/llama.cpp/ggml/src/libggml.so: undefined reference to `GOMP_single_start@GOMP_1.0'
      ~/text-generation-webui/installer_files/env/compiler_compat/ld: vendor/llama.cpp/ggml/src/libggml.so: undefined reference to `omp_get_num_threads@OMP_1.0'
      collect2: error: ld returned 1 exit status
      ninja: build stopped: subcommand failed.

      *** CMake build failed
9600- commented 3 months ago

Am able to build llama_cpp_python_cuda-0.2.81 using CMAKE_ARGS="-DLLAVA_BUILD=OFF"

However, model loads fail with the following error in TGW.

Exception: Cannot import 'llama_cpp_cuda' because 'llama_cpp' is already imported. See issue #1575 in llama-cpp-python. Please restart the server before attempting to use a different version of llama-cpp-python.
congson1293 commented 3 months ago

Have the same issue with RTX 4090 card + Ubuntu 20.04 + CUDA 12.1

9600- commented 2 months ago

@congson1293 what type of CPU are you running?

congson1293 commented 2 months ago

@9600- The bug exists in both Intel-CPU and AMD-CPU