-
Hey :wave: This is amazing work, thanks!
I'm trying to use the example code on an A100 but don't notice any speedups. Is this expected? I do see a 20-30% speedup on a 4090, so maybe it's due to dif…
-
### Proposal to improve performance
TL;DR: Move speculative decoding `scoring` prepare inputs to GPU, so a CPU synchronization can be skipped.
Currently, speculative decoding copies proposal token…
-
### Comment:
A conda-forge environment with nothing but pytorch for cuda in it currently ways in at 7.2GB, which is perceived to be rather on the heavy side of things.
Looking at potential for slimm…
-
For multi-GPU solution, we still have challenges for First Token Latency. The breakdown data is shared in offline.
please help add more optimization features (like SDP/Flash Attention etc) to improve…
-
### 🚀 The feature, motivation and pitch
Fuyou Training Framework Integration for PyTorch
Description:
Integrate the Fuyou training framework into PyTorch to enable efficient fine-tuning of larg…
-
Hi,
I am looking for a library to solve batches of trajectory optimization problems (same problem, different initial states) on accelerators, and I've found this library, which looks great! I have…
-
### 起始日期 | Start Date
9/3/2024
### 实现PR | Implementation PR
_No response_
### 相关Issues | Reference Issues
_No response_
### 摘要 | Summary
When using vLLM to optimally utilize GPU space for faste…
-
**Project Abstract**
This document proposes a matrix calculator web application that supports a wide variety of matrices and matrix operations (multiplication, Gaussian elimination, inversion, decomp…
-
## Project info
**Title:**
Optimization and GPU porting of information flow implementation
**Project lead and collaborators:**
Etienne Combrisson & Ruggero Basanisi
**Image of the proj…
-
### System Info
Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
tensorrt_llm 0.13.0.dev2024081300
tritonserver 2.48.0
triton image: 24.07
Cuda 12.5
### Wh…