-
推理子模块
- [ ] ir
- [ ] 手动构图,
- [ ] 自动构图,
- [ ] 模型转换,
- [ ] 模型解释,
- [ ] 计算图构建,
- [ ] 图优化,
- [ ] 内存优化,
- [ ] 高性能算子
-
# I wana evaluate the precision of the gguf model using llama.cpp as inference framework
## use these commands:
./llama-server -m /root/ICAS_test/models/Qwen-1_8B-Q8_0.gguf
lm_eval --model gguf …
-
I am currently trying to reproduce the results shown in Figure 4 - Inference Time vs Vocabulary Size from your project. I have a couple of questions regarding the methodology used for this figure:
…
-
1. **Prerequisite:** Make sure the LLM Inference framework can be launched following the SPMD style. For example, the LLM inference script can be launched by `torchrun --standalone --nproc=8 offline_i…
-
Have you considered incorporating this work into an open source inference framework, such as vLLM?
-
# OPEA Inference Microservices Integration for LangChain
This RFC proposes the integration of OPEA inference microservices (from GenAIComps) into LangChain [extensible to other frameworks], enabli…
-
### Request Description
Llama.cpp is a very popular and excellent LLM/VLM inference deployment framework, implemented in pure C/C++, without any dependencies, and cross-platform. Based on SYCL and Vu…
-
Hi.
I have a question regarding the prefetch implementation in your framework.
As I understand it, prefetching and inference should ideally run concurrently in separate CUDA streams. I noticed t…
-
How to implement Accelerate to split the model into multiple Gpus placed on different nodes for inference, if not, what other frameworks can implement it?
-
### Motivation
1、The qwen2vl effect is the sota level in the open source model
2、lmdeploy is an excellent inference framework
3、So it's important to support turbomind
### Related resources
_No re…