-
- it should automatically detect the best device to run on
- We should require 0 manual configuration from the user, by default llama.cpp for example requires specifying the device
-
### Your current environment
irrelevant
### How would you like to use vllm
What would be the arguments that would maximize overall throughput for large batch offline inference? More specifically, I…
-
We're planning on submitting a paper describing the [Y0 Causal Inference Engine](https://github.com/y0-causal-inference/y0) and have two clarifying questions:
1. Does JOSS have a notion of "senior"…
-
I've noticed that the logs currently record information regarding sample parameters besides the prompt. What I really need is the ability to log a trace_id for each request. My use case involves scena…
-
ERROR: [Torch-TensorRT] - Unsupported operator: aten::to.dtype_layout(Tensor(a) self, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, bool non_blocking=Fals…
-
await self.inference_engine.infer_tensor(request_id, shard, tensor, inference_state=inference_state)
ValueError: Shapes (1,8,4,60,119) and (60,60) cannot be broadcast
-
Unable to successfully perform inference tasks on Google Pixel 4 device, the error message is as follows:
```log
17:04:57.271 Remote...onImpl W requestCursorAnchorInfo on inactive InputConnect…
-
### System Info
- TensorRT-LLM main branch
### Who can help?
@kaiyux
### Information
- [x] The official example scripts
- [ ] My own modified scripts
### Tasks
- [x] An officially supported ta…
-
python -m awq.entry --model_path awq_cache/llama3-8b-w4-g128.pt \
--w_bit 4 --q_group_size 128 \
--run_awq --dump_awq awq_cache/llama3-8b-w4-g128.pt
Traceback (most recent call last):
F…
-
Hi! I know that PyTorch can use MPS to accelerate inference on Apple computers, but ONNX provides ability to use ANE (Apple Neural Engine) as well.
Have you tried converting BS-Roformer or Demucs to…