Closed walegahaha123 closed 1 week ago
torchrun --nnodes 1 --node_rank 0 --master_addr localhost --master_port 12345 run.py --config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml --MAX_ITEM_LIST_LENGTH 10 --epochs 5 --optim_args.learning_rate 1e-4 --MAX_TEXT_LENGTH 256 --train_batch_size 1 --gradient_checkpointing True --stage 3
this config has the same error nvmlDeviceGetP2PStatus(0,0,NVML_P2P_CAPS_INDEX_READ) failed: Invalid Argument
Hi,I'm sorry that I haven't met this before. I just test on a single gpu machine, and it seems to be good for me. Could you please check your NCCL env? Or you could run the script without ddp, but it may take time to solve some problems on codes.
code/overall/LLM_deepspeed.yaml, train_batch_size and eval_batch_size both set 1 NCCL error for single gpu, do you know why? thx!