Open dbzoo opened 1 year ago
Could you please share the configs you are using for this model?
Btw I just built different images for different BLAS backends:
Could you please let me know if that helps?
I tried those images, and all still resulted in the illegal instruction. Thanks for the extra images to test with. If I can find some cycles, I will clone the repo, investigate, and submit a push to correct it. I believe the issue is that openblas is not compiled with runtime CPU detection and assumes the instructions available in the CPU. With a modern CPU this isn't an issue, but I'm using an old CPU, my fault; it's what I have - no AVX flag which I think is the culprit.
$ cat /proc/cpuinfo
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes hypervisor lahf_lm pti ssbd ibrs ibpb stibp tsc_adjust arat flush_l1d arch_capabilities
For the record: config
models_dir: /models
model_family: llama
setup_params:
repo_id: botato/point-alpaca-ggml-model-q4_0
filename: ggml-model-q4_0.bin
model_params:
n_ctx: 512
n_parts: -1
n_gpu_layers: 0
seed: -1
use_mmap: True
n_threads: 8
n_batch: 2048
last_n_tokens_size: 64
lora_base: null
lora_path: null
low_vram: False
tensor_split: null
rope_freq_base: 10000.0
rope_freq_scale: 1.0
verbose: True`
If I start up a container using the image
docker run -it -v $PWD/models/:/models:rw -v $PWD/config/config.yaml:/llm-api/config.yaml:ro -p 8084:8000 --ulimit memlock=16000000000 1b5d/llm-api bash
At the shell prompt in the docker container.
$ FORCE_CMAKE=1 pip install --upgrade --no-deps --no-cache-dir --force-reinstall llama-cpp-python
Adjust this in the config and use a different model as the latest llama-cpp-python doesn't use GGML but GGUF
setup_params:
repo_id: TheBloke/Llama-2-7B-GGUF
filename: llama-2-7b.Q4_0.gguf
Then it starts without enabling BLAS - it's slow, but it works.
root@4784ec49bac1:/llm-api# python app/main.py
2023-11-15 02:27:54,593 - INFO - llama - found an existing model /models/llama_326323164690/llama-2-7b.Q4_0.gguf
2023-11-15 02:27:54,594 - INFO - llama - setup done successfully for /models/llama_326323164690/llama-2-7b.Q4_0.gguf
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /models/llama_326323164690/llama-2-7b.Q4_0.gguf (version GGUF V2)
....
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 72.06 MB
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
2023-11-15 02:17:35,436 - INFO - server - Started server process [219]
2023-11-15 02:17:35,436 - INFO - on - Waiting for application startup.
2023-11-15 02:27:55,563 - INFO - on - Application startup complete.
2023-11-15 02:27:55,564 - INFO - server - Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
That might give you some clues.
I presume there is a minimum CPU requirement like needing AVX2, AVX-512, FP16C or something?
Could you document the minimum instruction set and extensions required.
root@1d1c4289f303:/llm-api# python app/main.py 2023-10-26 23:31:19,237 - INFO - llama - found an existing model /models/llama_601507219781/ggml-model-q4_0.bin 2023-10-26 23:31:19,237 - INFO - llama - setup done successfully for /models/llama_601507219781/ggml-model-q4_0.bin Illegal instruction (core dumped) root@1d1c4289f303:/llm-api#
--- modulename: llama, funcname: init llama.py(289): self.verbose = verbose llama.py(291): self.numa = numa llama.py(292): if not Llama.__backend_initialized: llama.py(293): if self.verbose: llama.py(294): llama_cpp.llama_backend_init(self.numa) --- modulename: llama_cpp, funcname: llama_backend_init llama_cpp.py(475): return _lib.llama_backend_init(numa) Illegal instruction (core dumped)
I assume this has CPU requirements. ENV CMAKE_ARGS "-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.
https://github.com/OpenMathLib/OpenBLAS/blob/develop/README.md