-
### 🐛 Describe the bug
torchrun Multi machine and multi card training error.
Both Rank1 and Rank2 can be trained normally.
The error occurs after the successful establishment of nccl communicatio…
-
Hello! I am training the first two knowledge distillation stages of Mamba 2 on one DGX-H100x8 node, and I am experiencing train times of ~8 hours for the first stage, and ~13 hours for the second stag…
-
I am experimenting with the tutorial below
- https://github.com/FedML-AI/FedML/blob/master/fedml_experiments/distributed/fedavg/README.md
Run the following shell and dump the arguments to get.
```b…
-
**Is your feature request related to a problem? Please describe.**
I calculated correlation coefficients based on datasets with sizes between 90-180 GB using xarray and Dask distributed and experie…
-
**Describe the bug**
I try to finetune `llama3-8B` model with multi nodes but get an AtrributeError when finishing loading mcore format checkpoint and starting to build datasets, the error is below:
…
-
Hey,
I have seen the previous issues. Based on that I tracked down the approximate lane where the pipeline is struck which is the setup function where it failed to load the model. The training is n…
-
Recently, i use `open_mfdataset` to open a local tar.gz file of multiple netcdf files,
it failed to open it and raise a `distributed.scheduler.KilledWorker: Error` and
`TypeError: cannot serialize…
-
(Raised a similar [issue](https://github.com/NVIDIA/Megatron-LM/issues/450) in the Megatron repo, but I think it might be more appropriate here, so I'm adding more details)
I am trying to run Mega…
-
### Describe the bug
I run the training but get this error
### Reproduction
Run `accelerate config`
```
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: '…
kopyl updated
58 minutes ago
-
#### Describe the functionality you would like to see.
I would like to add to the documentation information about which file loaders support the `dask-distributed` backend. Mostly just add an extra …