XiaoxinHe / G-Retriever

Official Implementation of NeurIPS 2024 paper "G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering""
https://arxiv.org/abs/2402.07630
MIT License
330 stars 58 forks source link

CUDA error: invalid device ordinal (Finetuning with only one GPU) #30

Open quqxui opened 1 day ago

quqxui commented 1 day ago

Hi Xiaoxin,

I want to express my appreciation for the incredible work you and your team are doing.

But I encounted a problem: When I attempt to run the process on a single GPU, I get the following error:

Traceback (most recent call last):
  File "/home/ dr/research/G-Retriever/train.py", line 146, in <module>
    main(args)
  File "/home/ dr/research/G-Retriever/train.py", line 48, in main
    model = load_model[args.model_name](graph_type=dataset.graph_type, args=args, init_prompt=dataset.prompt)
  File "/home/ dr/research/G-Retriever/src/model/graph_llm.py", line 43, in __init__
    model = AutoModelForCausalLM.from_pretrained(
  File "/opt/anaconda3/envs/ dr-g_ret/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 564, in from_pretrained
    return model_class.from_pretrained(
  File "/opt/anaconda3/envs/ dr-g_ret/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4225, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/anaconda3/envs/ dr-g_ret/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4728, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/anaconda3/envs/ dr-g_ret/lib/python3.9/site-packages/transformers/modeling_utils.py", line 993, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/anaconda3/envs/ dr-g_ret/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 329, in set_module_tensor_to_device
    new_value = value.to(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I use two GPUs, the process runs without any issues. However, due to limited resources, using two GPUs is not feasible for me.

Could you please advise on how I might be able to successfully run the fine-tuning on just one GPU?

To roproduce the bug, run:

CUDA_VISIBLE_DEVICES=0 python train.py --dataset webqsp --model_name graph_llm --llm_frozen False 
giuseppefutia commented 1 day ago

I think you could follow the suggestion reported here: https://github.com/XiaoxinHe/G-Retriever/issues/29.

giuseppefutia commented 1 day ago

In general, you could try with my fork in the colab branch, in which I tried to address some of the issues related to the usage of one GPU: https://github.com/giuseppefutia/G-Retriever/tree/colab.

I tested with the following command and it should work:

!CUDA_LAUNCH_BLOCKING=1 python train.py --dataset webqsp --model_name graph_llm --llm_frozen False --batch_size 1 --eval_batch_size 2
otaviocx commented 23 hours ago

Consider using the parameter --max-memory we've introduced with this PR: https://github.com/XiaoxinHe/G-Retriever/pull/25