-
When using tensor parallelism, the computing power usage of one of the GPUs drops to 0, while the usage of the other GPU rises to 100%, the request does not respond, and the service cannot handle new …
-
Hi, I want to run one LLM model using multiple machines.
On one node, I want to use tensor parallel to speedup.
Within multiple nodes, I want to use pipeline parallel.
Is this supported? If s…
-
So basically I am trying to train LLama / Mistral.
I run the following command:
```bash
NEURON_RT_LOG_LEVEL=info XLA_USE_BF16=1 ./train_mistral.sh
```
Here is the link to [train_mistral.sh](ht…
-
Scaling models requires they be trained in data-parallel, pipeline parallel, or tensor parallel regimes. The last two, being both "model parallel", require a single model to be shared across GPUs. Thi…
-
I have inf2.24xlarge and I am running the Llama-2 inference example. All the packages are installed latest.
Everything worked fine until the step where I load model with tp_degree = 24 and it faile…
-
Hello, I have some questions when I use transformer_engine. There are some parallel operators in my model, such as RowParallelLinear and ColumnParallelLinear from flash_attn. How can I replace these o…
-
WIP project roadmap for LoRAX. We'll continue to update this over time.
# v0.10
- [ ] Speculative decoding adapters
- [ ] AQLM
# v0.11
- [ ] Prefix caching
- [ ] BERT support
- [ ] Embe…
-
Fastchat to enable baichuan2 LLM to use openai invoke, two V100 32G GPUs. It is more slower than model running with one GPU when inference. Nearly 3 tokens in 5 seconds.
```
python3 -m fastchat.serv…
-
Status: Draft
Updated: 09/10/2024
# Objective
In this doc we’ll talk about how different optimization techniques are structured in torchao and how to contribute to torchao.
# torchao Stack Ove…
-
Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the time to develop it further for several reasons:
- family, birth of family member, death of hobby time.
- competin…