-
### System Info
```Shell
accelerate 0.31.0
Ubuntu 22.04 (WSL)
python=3.10.14
```
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] …
-
I was trying to finetuning Meta-Llama-3-8B-Instruct using 4 gpus with the following command:
`torchrun --nproc_per_node 4 -m training.run --output_dir llama3test --model_name_or_path meta-llama/Met…
-
![image](https://github.com/user-attachments/assets/be98d5b2-f2aa-41aa-977e-15a7436f2727)
Why this error came out when I run the optimize.py file? I just dismiss the distributed training.
-
### 🐛 Describe the bug
Hello,
I'm a new user of PyTorch and recently tried to run the Flight Recorder code provided in the tools. But I cannot get the code to execute as expected.
I use ngc 24.10…
-
```
Executing Cell 19--------------------------------------
INFO:notebook:Training the model...
INFO:training:Using cuda:0 of 1
INFO:training:[config] ckpt_folder -> ./temp_work_dir/./models.
…
-
```
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-13-21-05 (UTC+0000), pid 2003528, cwd /work/asr4/zeyer/setups-data/comb
ined/2021-05-31/work/i6_core/returnn/t…
-
your build_dataloader:
if phase == 'train':
if dist: # distributed training
batch_size = dataset_opt['batch_size_per_gpu']
num_workers = dataset_opt['num_worke…
-
During the process of distributed training, I encountered the following problem when compiling Triton kernels:
```
Traceback (most recent call last):
......
File "/mnt/petrelfs/caoweihan/anaconda3…
-
Hi there,
Like tensorflow python has `tf.distribute`, what is the equivalent for the rust version?
Thanks
-
We have a guide on doing distributed training
w/ Vast here: https://docs.google.com/document/d/1W_dN3qarCOcLRDdEZ75LBtkLGiwUziWWDtVTjd43Ad4/edit?usp=sharing . However, we have not performed full dis…