Meituan-AutoML / MobileVLM

Strong and Open Vision Language Assistant for Mobile Devices
Apache License 2.0
969 stars 65 forks source link

how to train on multi nodes with deepspeed (take "pre_train" for example) #60

Closed thunder95 closed 1 week ago

thunder95 commented 3 weeks ago

Tried on two machines with deepspeed. Threw Error: RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:9901 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

can anyone give me an example? thank you!