1b5d / llm-api

Run any Large Language Model behind a unified API
MIT License
161 stars 25 forks source link

Illegal instruction (core dumped) #15

Open dbzoo opened 1 year ago

dbzoo commented 1 year ago

I presume there is a minimum CPU requirement like needing AVX2, AVX-512, FP16C or something?
Could you document the minimum instruction set and extensions required.

root@1d1c4289f303:/llm-api# python app/main.py 2023-10-26 23:31:19,237 - INFO - llama - found an existing model /models/llama_601507219781/ggml-model-q4_0.bin 2023-10-26 23:31:19,237 - INFO - llama - setup done successfully for /models/llama_601507219781/ggml-model-q4_0.bin Illegal instruction (core dumped) root@1d1c4289f303:/llm-api#

--- modulename: llama, funcname: init llama.py(289): self.verbose = verbose llama.py(291): self.numa = numa llama.py(292): if not Llama.__backend_initialized: llama.py(293): if self.verbose: llama.py(294): llama_cpp.llama_backend_init(self.numa) --- modulename: llama_cpp, funcname: llama_backend_init llama_cpp.py(475): return _lib.llama_backend_init(numa) Illegal instruction (core dumped)

I assume this has CPU requirements. ENV CMAKE_ARGS "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.

https://github.com/OpenMathLib/OpenBLAS/blob/develop/README.md

1b5d commented 1 year ago

Could you please share the configs you are using for this model?

1b5d commented 1 year ago

Btw I just built different images for different BLAS backends:

Could you please let me know if that helps?

dbzoo commented 1 year ago

I tried those images, and all still resulted in the illegal instruction. Thanks for the extra images to test with. If I can find some cycles, I will clone the repo, investigate, and submit a push to correct it. I believe the issue is that openblas is not compiled with runtime CPU detection and assumes the instructions available in the CPU. With a modern CPU this isn't an issue, but I'm using an old CPU, my fault; it's what I have - no AVX flag which I think is the culprit.

$ cat /proc/cpuinfo flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm pti ssbd ibrs ibpb stibp tsc_adjust arat flush_l1d arch_capabilities

For the record: config

models_dir: /models
model_family: llama
setup_params:
  repo_id: botato/point-alpaca-ggml-model-q4_0
  filename: ggml-model-q4_0.bin
model_params:
  n_ctx: 512
  n_parts: -1
  n_gpu_layers: 0
  seed: -1
  use_mmap: True
  n_threads: 8
  n_batch: 2048
  last_n_tokens_size: 64
  lora_base: null
  lora_path: null
  low_vram: False
  tensor_split: null
  rope_freq_base: 10000.0
  rope_freq_scale: 1.0
  verbose: True`

What worked for me

If I start up a container using the image

docker run -it -v $PWD/models/:/models:rw -v $PWD/config/config.yaml:/llm-api/config.yaml:ro -p 8084:8000 --ulimit memlock=16000000000 1b5d/llm-api bash

At the shell prompt in the docker container.

$ FORCE_CMAKE=1 pip install --upgrade --no-deps --no-cache-dir --force-reinstall llama-cpp-python

Adjust this in the config and use a different model as the latest llama-cpp-python doesn't use GGML but GGUF

setup_params:
  repo_id: TheBloke/Llama-2-7B-GGUF
  filename: llama-2-7b.Q4_0.gguf

Then it starts without enabling BLAS - it's slow, but it works.

root@4784ec49bac1:/llm-api# python app/main.py
2023-11-15 02:27:54,593 - INFO - llama - found an existing model /models/llama_326323164690/llama-2-7b.Q4_0.gguf
2023-11-15 02:27:54,594 - INFO - llama - setup done successfully for /models/llama_326323164690/llama-2-7b.Q4_0.gguf
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/llama_326323164690/llama-2-7b.Q4_0.gguf (version GGUF V2)
....
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 72.06 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-15 02:17:35,436 - INFO - server - Started server process [219]
2023-11-15 02:17:35,436 - INFO - on - Waiting for application startup.
2023-11-15 02:27:55,563 - INFO - on - Application startup complete.
2023-11-15 02:27:55,564 - INFO - server - Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

That might give you some clues.