XiaoxinHe / G-Retriever

Repository for G-Retriever
https://arxiv.org/abs/2402.07630
MIT License
315 stars 54 forks source link

Same issue as "RuntimeError: selected index k out of range #10" #13

Closed yucheny5 closed 4 months ago

yucheny5 commented 4 months ago

Dear authors,

I meet exactly the same issue as "RuntimeError: selected index k out of range #10". Could the author double check this error by running the processing code from scratch instead of using the processed dataset?

Thanks~

XiaoxinHe commented 4 months ago

Hi,

In response to #10, I tried to reproduce the error by running the preprocessing from scratch with python -m src.dataset.preprocess.webqsp and python -m src.dataset.webqsp. However, I did not encounter the same issue.

This issue is trigger by the graph where graph.x.size(0)==0 but graph.num_nodes!=0, and was likely due to a OOM error happened during preprocessing. It was likely caused by an OOM error during preprocessing. Could you please check if your program enters the except branch during preprocessing at this line? If so, please decrease the batch size until it fits into your memory and rerun the preprocessing. Thank you!

yucheny5 commented 4 months ago

Hi, Xiaoxin,

Thanks for the reply. I have solved this issue by using 2 80G A100 and modifing the batch size. However, When I run the following code: python train.py --dataset webqsp --model_name graph_llm

The following error occurs: Loading LLAMA Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00, 1.45s/it] Freezing LLAMA! Finish loading LLAMA! trainable params: 31485952 || all params: 6769901568 || trainable%: 0.4650872938659542 10%|████ | 353/3530 [05:44<43:38, 1.21it/s]Epoch: 0|10: Train Loss (Epoch Mean): 1.3945694382568932

for the line 146, in forward inputs_embeds = torch.cat([bos_embeds, graph_embeds[i].unsqueeze(0), inputs_embeds], dim=0) IndexError: index 7 is out of bounds for dimension 0 with size 7

In addition, It seems that "python train.py --dataset scene_graphs --model_name graph_llm" with frozen LLM takes over 1.5 day to run? Is the running time normal on 2 A100 80G GPU?

XiaoxinHe commented 4 months ago

For the error in the WebQSP dataset, could you please manually remove '2937' from the file dataset/webqsp/split/val_indices.txt? This is because dataset[2937] contains an empty graph. We will fix this issue in the code later.

Regarding the training on the Scene Graphs dataset, yes, it does take a long time to train due to its large scale (100k data points in total). However, we have introduced early stopping, which will halt training when the validation loss stops decreasing after two epochs. In my case, it stops at the 5th or 6th epoch and takes approximately 12 hours to finish with 2 A100 80G GPUs.

yucheny5 commented 4 months ago

Thanks for the reply! I have removed '2937' from the file and it runs smoothly.