abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.8k stars 934 forks source link

Serverless inferencing, basic chatbot style #1755

Open ericcurtin opened 5 days ago

ericcurtin commented 5 days ago

Is your feature request related to a problem? Please describe.

Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:

llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv

this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.

Describe the solution you'd like

Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.