Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage.
MIT License
1.02k
stars
68
forks
source link
[New Feature] Add new route for dllama api for embeding models #96
std::vector<Route> routes = {
{
"/v1/chat/completions",
HttpMethod::METHOD_POST,
std::bind(&handleCompletionsRequest, std::placeholders::_1, &api)
},
{
"/v1/models",
HttpMethod::METHOD_GET,
std::bind(&handleModelsRequest, std::placeholders::_1)
}
};
in ddlama api at master branch we have only 2 routes /v1/chat/completions and /v1/models but some model looks like llama3:8b has embedding functionality. Can add you add new route for /api/embeddings ?