-
- https://arxiv.org/pdf/2404.16710
- Diagram
![Screenshot 2024-10-30 at 9 29 59 PM](https://github.com/user-attachments/assets/425cf827-0a2d-4ac4-9884-1a454e0e6b04)
-
I'd like to explore the best approach for managing multi-client connections in both single and multi-GPU environments.
Often, GPUs are underutilized by a single client, especially when smaller mode…
-
# TensorRT Model Optimizer - Product Roadmap
[TensorRT Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer) (ModelOpt)’s north star is to be the best-in-class model optimization toolki…
-
# Pruning Convolutional Neural Networks for Resource Efficient Inference #
- Author: Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, Jan Kautz
- Origin: https://arxiv.org/abs/1611.06440
-…
-
Error occurred when executing Yoloworld_ESAM_Zho:
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
RuntimeError: invalid vector sub…
-
## vLLM Virtual Open Office Hours
We enjoyed seeing everyone at the previous office hours and got great feedback. These office hours are a ~bi-weekly live event where you come to learn more about t…
mgoin updated
2 weeks ago
-
Add Stan PPL integration to use Stan models with Blackjax inference algorithms
With the [BridgeStan](https://roualdes.github.io/bridgestan/latest/) library, we can efficiently access log density an…
-
Hi team,
I'm running inference on a g5.24xlarge GPU instance. The data is currently structured in a Pandas dataframe. I use Pandas apply method to apply the predict_entities function. When the df g…
-
Config:
Windows 10 with RTX4090
All requirements incl. flash-attn build - done!
Server:
```
(venv) D:\PythonProjects\hertz-dev>python inference_server.py
Using device: cuda
Loaded tokeniz…
-
## Progress
- [ ] Integrate CPU executor to support the basic model inference (BF16/FP32) without TP.
- #3634
- #3824
- #4113
- #4971
- #5452
- #5446
- [ ] Support FP16 mo…