-
Now I'm using cutlass in my project. I found that some cases have constraints to the layout, such as input matrix A and output matrix C should be row major. These kinds of assumption limit the feasibi…
-
## Description
Cleanup methods that are slower do not run to completion when restarting kernel.
## Reproduce
1. Create a new IPython notebook in Jupyter Lab.
2. Create and ex…
-
To address the MTU problem listed in #1853 for 4.3 and 4.4 kernels we could allow a user to pass a netdev interface name. The interface would be used as a VTEP (`lowerdev`) for VXLAN netdev we create.…
-
### Feature request
Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to HuggingFace Trainer, user could decide whether to enable kernel with a simple flag
### Motivation
Liger (Linkedi…
-
For the discrete kernels, looking at the temporal plots is pretty meaningful, since each kernel gets added just once.
However for the continuous kernels the kernels get convolved against some long …
-
Hello,
I used to use vllm to work with codellama2 13B using only 2 NVIDIA L4 GPU. The engine setup as follow:
python -m vllm.entrypoints.openai.api_server --model="codellama/CodeLlama-13b-Instruct…
-
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Upstream in DataFusion, there is a common common pattern where we have multiple input `Record…
-
I compared two ways to launch the server.
The model is vicuna-7b, and GPU is 2 \* A30.
and the 1st way is
```
python -m vllm.entrypoints.openai.api_server \
--model /data/models/vicuna-…
-
I am using trtllm 0.8.0 (added moe support following llama's implementation). we serve models with trtllm_backend (docker images triton-trtllm-24.02)
[qwen2-moe-57B-A14B](https://huggingface.co/Qwe…
-
Hello,
I am having this issue where I run an initial computation, basic multiplication of matrices using TFJS. On the first run, the computation is extremely slow, taking 800ms. On the second run, …