Distributed training issue

LuxDL / Lux.jl

Elegant and Performant Scientific Machine Learning in Julia

https://lux.csail.mit.edu/

MIT License

508 stars 63 forks source link

Distributed training issue #983

Open jakubMitura14 opened 1 month ago

jakubMitura14 commented 1 month ago

Hello I have 2 GPU as shown by nvidia smi

Then I try

DistributedUtils.initialize(NCCLBackend) 
distributed_backend = DistributedUtils.get_distributed_backend(NCCLBackend)
DistributedUtils.local_rank(distributed_backend) #0
DistributedUtils.total_workers(distributed_backend) #1

and local_rank evaluates to 0 ; total_workers evaluate to 1. Seems to be incorrect, if I understand idea well.

I use Lux v1.1.0 CUDA v5.5.2 Julia 1.10

avik-pal commented 1 month ago

You need to start julia with mpiexec https://github.com/LuxDL/Lux.jl/tree/main/examples/ImageNet#distributed-data-parallel-training

jakubMitura14 commented 1 month ago

For now it still do not work but I need to dig up deeper into mpi first. Thanks for guidance!