NCCL error - Githubissues

bytedance / HLLM

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Apache License 2.0

172 stars 21 forks source link

NCCL error #12

Closed walegahaha123 closed 1 week ago

walegahaha123 commented 2 weeks ago

fig code/overall/LLM_deepspeed.yaml, train_batch_size and eval_batch_size both set 1 NCCL error for single gpu, do you know why? thx！

walegahaha123 commented 2 weeks ago

torchrun --nnodes 1 --node_rank 0 --master_addr localhost --master_port 12345 run.py --config_file overall/LLM_deepspeed.yaml HLLM/HLLM.yaml --MAX_ITEM_LIST_LENGTH 10 --epochs 5 --optim_args.learning_rate 1e-4 --MAX_TEXT_LENGTH 256 --train_batch_size 1 --gradient_checkpointing True --stage 3

this config has the same error nvmlDeviceGetP2PStatus(0,0,NVML_P2P_CAPS_INDEX_READ) failed: Invalid Argument

ssyzeChen commented 1 week ago

Hi，I'm sorry that I haven't met this before. I just test on a single gpu machine, and it seems to be good for me. Could you please check your NCCL env? Or you could run the script without ddp, but it may take time to solve some problems on codes.