-
- [x] Formalize the model
- [ ] Lowering separates `vertical regions` to `MultiStages`
- [ ] Lowering separates `statements` to `Stages`
- [ ] Rework Stage Splitting Pass (now pure optimisation)
…
-
### 🚀 The feature, motivation and pitch
It is common to have a scenario where folks want to deploy multiple vLLM instances on a single machine due to the machine have several GPUs (commonly 8 GPUs). …
-
### 问题描述
固定 seed 测了下,为了确认 seed 是固定的,先重复运行了多卡脚本,确保每次图像不变。
在这个条件下,不同卡数生成的图像:
| | image |
|--------------------------------|-------|
| flux_result_dp1_cfg1_ulysses1_…
-
### Your current environment
The output of `python collect_env.py`
```text
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N…
-
#### What happened
When using LiteLLM with Redis caching enabled and making parallel calls, incorrect trace_ids are being sent to Langfuse, despite langfuse_context.get_current_trace_id() returning…
-
### System Info
```shell
using Huggingface AMI from AWS marketplace with Ubuntu 22.04
optimum-neuron 0.0.25
transformers 4.45.2
peft 0.13.0
trl 0.11.4
accelerate 0.29.2
torch 2.1.2
```
…
-
**Describe the bug**
I am trying to convert the default `mamba.nemo` file (I converted [form huggingface](https://huggingface.co/nvidia/mamba2-8b-3t-4k/tree/main) .pt to .nemo) to have `tensor_parall…
-
### 🚀 The feature, motivation and pitch
For transformer architecture (for example https://github.com/pytorch-labs/gpt-fast/blob/main/model.py#L195-L211) it tends to be most performant to merge the qk…
-
**Why is it that when using a quantitative model for inference, the TTFT optimization is not obvious, but the overall inference efficiency is improved a lot? At the same time, the inference efficiency…
-
First of all, thank you for your amazing work on the nnScaler project. It has been incredibly inspiring, and I’ve been learning and using the contents from this repository in my own work.
I have a fe…