CUDA error: invalid device ordinal

zhuiyue233 commented 4 months ago

When I run graphgpt_stage1.sh，it makes some errors：

/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 125, in init File "/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/training_args.py", line 1372, in __post_init__ and (self.device.type != "cuda") File “/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

could you please give me some help？

MaoPopovich commented 4 months ago

I met the same question and could not figure out why.

zhuiyue233 commented 4 months ago

I met the same question and could not figure out why.

I solved it. I try to install requirement.txt one by one. And pip install -U bitsandbytes again.

I don't kown why.

zhuiyue233 commented 4 months ago

I met the same question and could not figure out why.

And maybe you can try to change python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 to python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --master_port=20001 in file "graphgpt_stage1.sh "

MaoPopovich commented 4 months ago

I met the same question and could not figure out why.

And maybe you can try to change python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 to python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --master_port=20001 in file "graphgpt_stage1.sh "

thanks, this error have been resolved.

HKUDS / GraphGPT

CUDA error: invalid device ordinal #54