OFA-Sys / ONE-PEACE

A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Apache License 2.0
964 stars 63 forks source link

Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #47

Closed CheruscanArminius closed 10 months ago

CheruscanArminius commented 10 months ago

The paper says the algorithm has been trained with 8 A100 GPUs. I am having two instances, each equipped with 4 A100s instead of one GPU instance with 8 A100 GPUs. Is there any way to specify the instances in the configurations?

In another word, where can I specify the number of nodes in the code? https://lightning.ai/docs/pytorch/stable/common/trainer.html#num-nodes https://pytorch.org/docs/stable/elastic/run.html https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide

I would do appreciate if you could give a comment on these.

CheruscanArminius commented 10 months ago

Found it here: https://github.com/OFA-Sys/ONE-PEACE/blob/4de62d73034196ab7ea33daeb4a9010b4a9cdf03/fairseq/docs/getting_started.rst#L171