getumbrel / llama-gpt

A self-hosted, offline, ChatGPT-like chatbot. Powered by Llama 2. 100% private, with no data leaving your device. New: Code Llama support!
https://apps.umbrel.com/app/llama-gpt
MIT License
10.82k stars 699 forks source link

Trap invalid opcode #110

Closed Kekec852 closed 1 year ago

Kekec852 commented 1 year ago

Hello, i'm trying to get models working with gpu without success (i have managed to get 7b and 13b working on CPU). The error i'm getting is very very very strange (running: sudo ./run.sh --model 13b --with-cuda):

traps: python3[91882] trap invalid opcode ip:7ff310d6d72d sp:7fff5bdc7390 error:0 in libllama.so[7ff310d50000+76000]

and that is in kernel log on container i get no usbale output:

==========
== CUDA ==
==========

CUDA Version 12.1.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/models/llama-2-13b-chat.bin model found.
make: *** No rule to make target 'build'.  Stop.
Initializing server with:
Batch size: 2096
Number of CPU threads: 90
Number of GPU layers: 10
Context window: 4096

CPU i'm using:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         44 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  90
  On-line CPU(s) list:   0-89
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU E7-4850 v2 @ 2.30GHz
    CPU family:          6
    Model:               62
    Thread(s) per core:  2
    Core(s) per socket:  11
    Socket(s):           4
    Stepping:            7
    BogoMIPS:            4588.94
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 popcnt aes xsave avx f16c rdrand hypervisor lahf_lm pti ssbd ibrs ibpb stibp fsgsbase smep erms xsaveopt md_clear flush_l1d arch_capabilities
Virtualization features: 
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   1.4 MiB (46 instances)
  L1i:                   1.4 MiB (46 instances)
  L2:                    11.5 MiB (46 instances)
  L3:                    96 MiB (4 instances)
NUMA:                    
  NUMA node(s):          4
  NUMA node0 CPU(s):     0-22
  NUMA node1 CPU(s):     23-45
  NUMA node2 CPU(s):     46-67
  NUMA node3 CPU(s):     68-89
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Mitigation; Clear CPU buffers; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Unknown: No mitigations
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

GPU:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti     On  | 0000BB99:00:00.0 Off |                  N/A |
| 22%   35C    P8              10W / 250W |      1MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Everything is running in vm on freshly installed ubuntu 22.04.

I'm thinking that some component like llama_cpp.server is compiled in such way that is incompatible with some part of my setup, as referred with invalid op code.

I will update this issue wile doing more debugging.

Kekec852 commented 1 year ago

Hello, me. I have found the problem CPU is missing AVX 2 and FMA features and therefore it will not run unless you add to cuda/ggml.Dockerfile

line 24: RUN CMAKE_ARGS="-DLLAMA_CUBLAS=on -DLLAMA_AVX2=off -DLLAMA_FMA=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78

Then it will run. Maybe this will be helpful to somebody else with old hardware. got 1 tokens / s on cpu and 1.8 tokes / s on gpu.