-
repro:
```
CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --float8.enable_fsdp_float8_all_gather --float8.scaling_type_weight "delayed" --metrics.lo…
-
When I run this example [runs on multiple gpus using Distributed Data Parallel (DDP) training](https://docs.lightly.ai/self-supervised-learning/examples/simclr.html) on AWS SageMaker with 4 GPUS and …
-
# Summary
Adjust the values around incontinence supplies in the annual survey to include kits.
# Why?
Time saved around annual reporting. Requested by NDBN
# Details
This is similar to the recent c…
-
### Gloo Edge Product
Open Source
### Gloo Edge Version
v1.15
### Is your feature request related to a problem? Please describe.
Add a literalsForTags (and other xxxForTags fields) field to [rout…
-
**Current Behavior**
Right now performance tests using NightHawk are limited to single instance load generation. This limits the amount of traffic that can be generated to the output of the single …
-
### 🐛 Describe the bug
this simple code:
```python
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
group = None
def all_gather(input_: torch.Tensor, …
-
Llama.cpp now supports distribution across multiple devices to boost speeds, this would be a great addition to Ollama
https://github.com/ggerganov/llama.cpp/tree/master/examples/rpc
https://www.re…
-
### System Info
- Python 3.10
- torch==2.4.1 and torch==2.5.1+cu121
- bitsandbytes==0.44.1
- llama-recipes 0.4.0.post1 and 0.4.0
### Reproduction
While running:
```bash
torchrun --nnodes…
-
Hello,
I was wondering if there is a manual available of the application(screenshots).
And under which license the software is distributed?
Thank you in advance for your answer.
-
Hi, size. I‘m using my own dataset to recurrent your work. I noticed you used slurm for training. But for me, I only can use distributed training with dist_train.sh to train my own project. But there …