About distributed training

bytedance / SALMONN

SALMONN: Speech Audio Language Music Open Neural Network

https://bytedance.github.io/SALMONN/

Apache License 2.0

978 stars 75 forks source link

About distributed training #44

Closed ucasyouzhao1987 closed 3 months ago

ucasyouzhao1987 commented 3 months ago

How to run your code in a distributed training? I try to set "use_distributed: True" in your configuration file, but I found it is not work. I found it only support one gpu mode.

Yu-Doit commented 3 months ago

The current version supports both single node multi GPUs mode and multi nodes multi GPUs mode, so just run the train.py script with torchrun. If you have encountered any problem, feel free to talk about it here!!

SoshyHayami commented 3 months ago

@Yu-Doit

how about inference on multi-gpu?

Yu-Doit commented 3 months ago

@Yu-Doit

how about inference on multi-gpu?

It's the same as training. The only difference is that you should set run.evaluate: True in your config, which will skip training and run inference directly.