-
[2024-08-09 17:29:22,420] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-08-09 17:29:22,567] torch.distributed.elastic.multiprocessing.redirects: […
-
Thank you for the great work!
Could you please provide some examples about functional approach with distributed multi-gpu training?
-
Experimental environment: Two Ubuntu GPU servers
Experimental code source: https://github.com/OvJat/DeepSpeedTutorial.git
Fault Description: I used engine. save() to save the model training status …
-
Instead of using our own task pool, we should leverage Dask distributed, as this will allow us to better consume resources from existing clusters.
-
model = torch.nn.parallel.DistributedDataParallel(model)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled cuda error, NCCL version 2.4.8
cc @pietern @mrsh…
Hznnn updated
3 years ago
-
hello everyone,
![Screenshot from 2024-05-10 20-16-55](https://github.com/TencentARC/GFPGAN/assets/107725595/78b5a5a5-0ea3-4f50-8a0b-97640b851e48)
I'm encountering errors while training a GFPGAN …
-
On `PP + FSDP` and `PP + TP + FSDP`:
- Is there any documentation on how these different parallelisms compose?
- What are the largest training runs these strategies have been tested on?
- Are there…
-
Running the gluestick training code, only returning to start the experiment, but there is no training process, and there is no result, is it a training failure? Or am I not finding the right way to ob…
-
Currently, FluxMPI has only [1 example](https://github.com/avik-pal/FluxMPI.jl/blob/main/examples/fastai/train.jl). It would be good to showcase training of more image models -- ViT (https://github.co…
-
Distributed training on multiple devices generates this error.
```
dcrnn_gpu.py:16: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please r…