bug: `openllm run phi3:3.8b-ggml-q4` build fails to find FOUNDATION_LIBRARY

sij1nk commented 3 months ago

Describe the bug

Hi!

I tried to run an llm locally using openllm, and phi3:3.8b-ggml-q4 happens to be the only model which I am able to run locally according to openllm, so I ran openllm run phi3:3.8b-ggml-q4, which failed (logs are attached).

The failure happens during a cmake build process, as it is unable to find FOUNDATION_LIBRARY (line 116 in the logs). I looked into what this library provides and ended up in ggerganov/llama.cpp

However I assume GGML_METAL should not be set, as I'm running on a Linux (well, Windows10 + WSL2) system and an nvidia gpu. My attempts at disabling the Metal build were not successful, command failed with the same error as before.

On the same machine, but on Windows, openllm run phi3:3.8b-ggml-q4 fails on the same cmake build with the same error. Curiously, openllm hello does not recognize this model as locally runnable as it did on WSL, but I did not investigate why

Please let me know if I should raise this issue in ggerganov/llama.cpp instead

Thanks!

To reproduce

Happen to have the same system as I do, I guess
Run openllm run phi3:3.8b-ggml-q4
Observe the error

Logs

gist

Environment

bentoml env:

#### Environment variable

BENTOML_DEBUG=''
BENTOML_QUIET=''
BENTOML_BUNDLE_LOCAL_BUILD=''
BENTOML_DO_NOT_TRACK=''
BENTOML_CONFIG=''
BENTOML_CONFIG_OPTIONS=''
BENTOML_PORT=''
BENTOML_HOST=''
BENTOML_API_WORKERS=''

#### System information

`bentoml`: 1.3.1
`python`: 3.10.12
`platform`: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
`uid_gid`: 1000:1000
<details><summary><code>pip_packages</code></summary>

<br>

</details>

transformers-cli env:

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- `transformers` version: 4.44.0
- Platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
- Python version: 3.10.12
- Huggingface_hub version: 0.24.5
- Safetensors version: 0.4.4
- Accelerate version: not installed
- Accelerate config: not found
- PyTorch version (GPU?): not installed (NA)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>

openllm -v:

openllm, 0.6.7
Python (CPython) 3.10.12

System information (Optional)

memory: 32GB platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 architecture: x86-64 cpu: intel core i7-11850H @ 2.50GHz gpu: NVIDIA RTX A2000 Laptop GPU 4GB

bojiang commented 2 months ago

Hi. Thanks for contributing. For now llamacpp models in openllm can only be deployed to MacOS. There are some hard coded platform specified parameters. I suggest to use vllm on Linux and even WSL2, since it is faster and more mature for production usages.

We are adding a new feature for easier tweak on parameters. Maybe you can help us to find out a pair of configuration to run vllm models on a 4G GPU after released.

sij1nk commented 2 months ago

Thank you! I will look into vllm some time later.

bentoml / OpenLLM