Unable to use GPUs 4,5,6,7 for training

shashwat14 commented 3 weeks ago

First of all - thanks for all the great work!

My setup is on an H100. I am trying to use GPU 4,5,6,7 but I get the following error. I am able to run successfully with 0,1,2,3,4,5,6,7 GPUs. However, I only want to use half. Not sure how to resolve this.

During pretrain.sh or finetune.sh, I get the following error.

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
:1331, internal error - please report this issue to the NCCL developers, NCCL version 2.18.1          
ncclInternalError: Internal check failed.                                                             
Last error:                                                                                           
Attribute arch of node cpu not found

YingHuTsing commented 3 weeks ago

Hi, please refer to https://github.com/NVIDIA/nccl/issues/833 to see if it helps

shashwat14 commented 3 weeks ago

I'm afraid that's not the same error. Their issue has something to do with network while my issue is that arch of node cpu not found. I suspect this is because the llava model is initially placed on CPU and then eventually moved to cuda.

shashwat14 commented 3 weeks ago

BTW able to run this on A100 with GPUs 4,5,6,7 only. So maybe it's an H100 specific issue.

TinyLLaVA / TinyLLaVA_Factory

Unable to use GPUs 4,5,6,7 for training #112