-
Using our launcher and the latest pull of our pretrain repo you can run a Llama3 70B model as follows. Thanks to @AleHD for getting activation recompute and async working.
```
(export DP=1 PP=4 BA…
-
Dear torchtitan team, I have a question regarding gradient norm clipping when using pipeline parallelism (PP) potentially combined with `FSDP/DP/TP`.
For simplicity, let's assume each process/GPU h…
-
We should get major efficiencies and speedup by running multiple "data" pathways through the same synaptic weights and network architecture. In effect, it is like "shared weights" for multiple copies…
-
Running model forwards within a process seems to get stuck. I tried to set `TOKENIZERS_PARALLELISM` to `true` and `false` but unfortunately both couldn't help 🥲
### System Info
`transformers-cli…
-
手里没有4X80G的卡, 在4X40G的卡环境中Qwen2-VL-72B-Instruct显存不够, 通过多node用模型流水来部署, 但是vLLM中不支持, 这个咱们后续后可能支持吗
```bash
python3 -m vllm.entrypoints.openai.api_server --port 8000 --model /llm_weights/Qwen2-VL-72B-Ins…
-
**What is your question?**
Hello!
I’ve been exploring the Cutlass examples for GEMM and Convolution and noticed the use of double buffering.
https://developer.nvidia.com/blog/cutlass-linear-algebra-…
-
**Is your feature request related to a problem? Please describe.**
I’m facing an issue when deploying large models in Kubernetes, especially when the pod’s ephemeral storage is limited. Triton Infere…
-
### 🚀 The feature, motivation and pitch
I am trying to run a 70B model on a node with 3XA100-80Gi.
2XA100-80Gi does not contain enough VRAM to run the model, and when I try to run vLLM with tensor p…
-
### Motivation.
As vllm supports more and more models and functions, they require different attention, scheduler, executor, and input output processor. . These modules are becoming increasingly com…
-
### Issue type
Feature Request
### Have you reproduced the bug with TensorFlow Nightly?
Yes
### Source
binary
### TensorFlow version
tf 2.15
### Custom code
No
### OS pla…