Open shashwat14 opened 3 weeks ago
Hi, please refer to https://github.com/NVIDIA/nccl/issues/833 to see if it helps
I'm afraid that's not the same error. Their issue has something to do with network while my issue is that arch of node cpu not found. I suspect this is because the llava model is initially placed on CPU and then eventually moved to cuda.
BTW able to run this on A100 with GPUs 4,5,6,7 only. So maybe it's an H100 specific issue.
First of all - thanks for all the great work!
My setup is on an H100. I am trying to use GPU 4,5,6,7 but I get the following error. I am able to run successfully with 0,1,2,3,4,5,6,7 GPUs. However, I only want to use half. Not sure how to resolve this.
During pretrain.sh or finetune.sh, I get the following error.