# For Mac (CPU and GPU), windows (CPU and CUDA), or linux (CPU and CUDA)
llm_client="*"
This will download and build llama.cpp. See build.md for other features and backends like mistral.rs.
use Llmclient::prelude::*;
// Loads the largest quant available based on your VRAM or system memory
let llm_client = LlmClient::llama_cpp()
.mistral7b_instruct_v0_3() // Uses a preset model
.init() // Downloads model from hugging face and starts the inference interface
.await?;
Several of the most common models are available as presets. Loading from local models is also fully supported. See models.md for more information.
In addition to basic LLM inference, llm_client is primarily designed for controlled generation using step based cascade workflows. This prompting system runs pre-defined workflows that control and constrain both the overall structure of generation and individual tokens during inference. This allows the implementation of specialized workflows for specific tasks, shaping LLM outputs towards intended, reproducible outcomes.
let response: u32 = llm_client.reason().integer()
.instructions()
.set_content("Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?")
.return_primitive().await?;
// Recieve 'primitive' outputs
assert_eq!(response, 1)
This runs the reason one round cascading prompt workflow with an integer output.
This method significantly improves the reliability of LLM use cases. For example, there are test cases this repo that can be used to benchmark an LLM. There is a large increase in accuracy when comparing basic inference with a constrained outcome and a CoT style cascading prompt workflow. The decision workflow that runs N count of CoT workflows across a temperature gradient approaches 100% accuracy for the test cases.
I have a full breakdown of this in my blog post, "Step-Based Cascading Prompts: Deterministic Signals from the LLM Vibe Space."
Jump to the readme.md of the llm_client crate to find out how to use them.
Shelby Jenkins - Here or Linkedin