facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.95k stars 787 forks source link

Launching train/train.py directly without Slurm #161

Open vladchimescu opened 1 year ago

vladchimescu commented 1 year ago

Hi, I am trying to launch dinov2/train/train.py script directly without the Slurm scheduler. I use the following command to launch the training:

export CUDA_VISIBLE_DEVICES=0,1 && python dinov2/train/train.py --config_file myconfig.yaml --output-dir my_outputdir

However, I can't seem to get it to work for training on multiple GPUs. I also tried using torchrun but haven't found the right argument combination.

I'm looking for a minimal example of launching train/train.py with FSDP, without the use of run/train.py. At the same time I'd like to enable multi-GPU training using FSDP.

usryokousha commented 1 year ago

just use

export CUDA_VISIBLE_DEVICES=0,1
export PYTHONPATH=absolute/workspace/directory
python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir --use-env 
vladchimescu commented 1 year ago

@usryokousha Thanks! I managed to get it running without the --use-env flag:

export CUDA_VISIBLE_DEVICES=0,1 && python -m torch.distributed.launch --nproc_per_node=2 dinov2/train/train.py --config-file=myconfig.yaml --output-dir=my_outputdir 

In fact, feeding --use-env resulted in an error as it was an unrecognised argument to the script. I guess one could add it to the argparser.

By the way, I had to add the following in dinov2/train/train.py:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.") 

Multi-GPU training definitely works, but weirdly it shows the current_batch_size: 128.0000, which is my batch size per GPU. I would have expected for it to show 256 ( = 128 * 2 GPUs)?

qasfb commented 1 year ago

It's just a logging issue, it displays the batch size per gpu; maybe we can put a better name

BenSpex commented 1 year ago

@patricklabatut and @usryokousha Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation Torchrun offers more fault tolerance etc.

GravityZL commented 1 year ago

Look into the code: dinov2/distributed/init.py, simply changes the:

self.local_rank = int(os.environ["LOCAL_RANK"]) in method: def _set_from_azure_env(self): to: self.local_world_size = torch.cuda.device_count()

then it works.

Or as vladchimescu mentioned, add one argument:

parser.add_argument("--local-rank", default=0, type=int, help="Variable for distributed computing.")

and start your training with:

export CUDA_VISIBLE_DEVICES=xx,xx

Hope it helps :)

GravityZL commented 1 year ago

@patricklabatut and @usryokousha Any reason to use python -m torch.distributed.launch over torchrun ? At least to the pytorch documentation Torchrun offers more fault tolerance etc.

both work, but python -m torch.distributed.launch is/will be deprecated

Shizhen-ZHAO commented 1 year ago

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

GZ-YourZY commented 11 months ago

请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练

你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?

GravityZL commented 11 months ago

请问可以公布多一点多GPU训练的修改细节吗,两种方式均尝试后,我的代码仍然只使用单张GPU进行训练

你好,我尝试使用你的方法直接启动 train/train.py 而不是使用 Slurm 。 但我发现使用多 GPU 的训练时间比仅使用一个 GPU 慢旋转。太奇怪了。 您遇到同样的问题吗? ?

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

Maybe you need to set the sampler type from INFINITE to DISTRIBUTED

GravityZL commented 11 months ago

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

I had the same issue for multi node training but not for multi gpu within one node

qasfb commented 11 months ago

I depends on your inter-node connectivity

GravityZL commented 11 months ago

I depends on your inter-node connectivity

I have infiniband for the internode connection, but I check the whole training process, the infiniband is not really used. I wondered if I dont have slurm in the cluster, how can I enable the distributed training with the same speed(or at least comparable)? Thank you

qasfb commented 11 months ago

if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

usryokousha commented 11 months ago

It seems strange to me that it would be REALLY slow.  It may not be a bad idea to use DISTRIBUTED instead of INFINITE due to some slow down per process in INFINITE.I wouldn’t expect a major difference though.  You could just launch through SLURM however for single node.나의 iPhone에서 보냄2023. 11. 9. 오전 12:24, GravityZL @.***> 작성:

Hi, I try to launch train/train.py directly without Slurm using your methods. But I found that the training time of using multi-gpu is much slower than using just one gpu. It's so weird. Did you encounter the same problem?

I had the same issue for multi node training but not for multi gpu within one node

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

GravityZL commented 11 months ago

if infiniband is not used, maybe there is a problem with the cluster configuration ? are you able to run nccl-tests, and does it give the perf that it should ? https://github.com/NVIDIA/nccl-tests

i think maybe you could copy-paste the pytorch distributed initialization functions from a setup that you are sure works on your cluster

Thank you! I have solved the issue, it is indeed the cluster is not properly setup

adipill04 commented 5 months ago

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

ahmed1996said commented 5 months ago

I was able to run train.py directly without SLURM thanks to this thread. However, now I am faced with the challenge of trying to use 2 GPUs to train the model on my dataset. When running the training, my reports show me that the 2nd GPU isn't being used at all if not very little. My question: is there any other change I need to make to the training script to ensure it uses more than 1 GPU? Thanks

Same issue here @adipill04. I'm using torchrun:

torchrun --nproc_per_node=2 dinov2/train/train.py --config-file=<PATH_TO_YAML> --output-dir=<PATH_TO_OUTPUT>

It seems to only train on a single GPU. Did you find a solution for this?

TumVink commented 1 month ago

Heyy,

I am also facing the problem that the training with 8 GPUs is slower or equal to the training with 2 GPUs. I tried both INFINITE and DISTRIBUTED sampler, but neither works.

Any idea to solve the problem?

adipill04 commented 1 month ago

Heyy,

I am also facing the problem that the training with 8 GPUs is slower or equal to the training with 2 GPUs. I tried both INFINITE and DISTRIBUTED sampler, but neither works.

Any idea to solve the problem?

Are you using a cluster like SLURM? Are you also configuring an environment variable called 'NCCL_P2P_DISABLE'? I ran into some similar issues as you with multi gpu training not speeding up the training process and had to disable p2p communication for it to even start. The underlying issue in my case was a wrong hardware configuration.

I'd suggest using weights and biases (wandb.ai) to track model training metrics if you don't already. It might be useful for understanding what your bottleneck is.

TumVink commented 1 month ago

Heyy Thanks for your reply @adipill04 .

No I am not using slurm cluster, rather a DDP torchrun framework. Within the node, GPUs are communicated with each other via NV-Link. I tuned the num_workers and found that the speed seems more normal now.

adipill04 commented 1 month ago

Heyy Thanks for your reply @adipill04 .

No I am not using slurm cluster, rather a DDP torchrun framework. Within the node, GPUs are communicated with each other via NV-Link. I tuned the num_workers and found that the speed seems more normal now.

Great! To add on, for num_workers I felt setting it to 4 * # of gpus yielded fastest speeds on dataloader side of things

TumVink commented 1 month ago

Yes, that makes sense when you have an efficient data loader pipeline!

sipie800 commented 1 month ago

+1