Is your feature request related to a problem? Please describe.
Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:
llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv
this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.
Describe the solution you'd like
Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Is your feature request related to a problem? Please describe.
Serverless inferencing is possible with llama.cpp via llama-cli, for example one can do something like this:
llama-cli -m /home/curtine/.local/share/ramalama/models/ollama/granite-code:latest --log-disable --in-prefix --in-suffix --no-display-prompt -p "You are a helpful assistant" -c 2048 -cnv
this can be useful, no idle processes around, now single point of failure, no blocked port, each process can target a separate GPU/CPU, no need for root access to restart some daemon, etc.
Describe the solution you'd like
Serverless inferencing support, something similar to llama-cli but integrated with llama-cpp-python.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.