HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
493 stars 36 forks source link

CUDA error: invalid device ordinal #54

Closed zhuiyue233 closed 4 months ago

zhuiyue233 commented 4 months ago

When I run graphgpt_stage1.sh,it makes some errors:

/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/hf_argparser.py", line 338, in parse_args_into_dataclasses obj = dtype(**inputs) File "", line 125, in init File "/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/training_args.py", line 1372, in __post_init__ and (self.device.type != "cuda") File “/anaconda3/envs/GGPT/lib/python3.10/site-packages/transformers/training_args.py", line 1795, in device return self._setup_devices

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

could you please give me some help?

MaoPopovich commented 4 months ago

I met the same question and could not figure out why.

zhuiyue233 commented 4 months ago

I met the same question and could not figure out why.

I solved it. I try to install requirement.txt one by one. And pip install -U bitsandbytes again.

I don't kown why.

zhuiyue233 commented 4 months ago

I met the same question and could not figure out why.

And maybe you can try to change python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 to python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --master_port=20001 in file "graphgpt_stage1.sh "

MaoPopovich commented 4 months ago

I met the same question and could not figure out why.

And maybe you can try to change python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 to python -m torch.distributed.run --nnodes=1 --nproc_per_node=1 --master_port=20001 in file "graphgpt_stage1.sh "

thanks, this error have been resolved.