Closed yucheny5 closed 4 months ago
Hi,
In response to #10, I tried to reproduce the error by running the preprocessing from scratch with python -m src.dataset.preprocess.webqsp
and python -m src.dataset.webqsp
. However, I did not encounter the same issue.
This issue is trigger by the graph where graph.x.size(0)==0
but graph.num_nodes!=0
, and was likely due to a OOM error happened during preprocessing. It was likely caused by an OOM error during preprocessing. Could you please check if your program enters the except branch during preprocessing at this line? If so, please decrease the batch size until it fits into your memory and rerun the preprocessing. Thank you!
Hi, Xiaoxin,
Thanks for the reply. I have solved this issue by using 2 80G A100 and modifing the batch size. However, When I run the following code: python train.py --dataset webqsp --model_name graph_llm
The following error occurs: Loading LLAMA Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00, 1.45s/it] Freezing LLAMA! Finish loading LLAMA! trainable params: 31485952 || all params: 6769901568 || trainable%: 0.4650872938659542 10%|████ | 353/3530 [05:44<43:38, 1.21it/s]Epoch: 0|10: Train Loss (Epoch Mean): 1.3945694382568932
for the line 146, in forward inputs_embeds = torch.cat([bos_embeds, graph_embeds[i].unsqueeze(0), inputs_embeds], dim=0) IndexError: index 7 is out of bounds for dimension 0 with size 7
In addition, It seems that "python train.py --dataset scene_graphs --model_name graph_llm" with frozen LLM takes over 1.5 day to run? Is the running time normal on 2 A100 80G GPU?
For the error in the WebQSP dataset, could you please manually remove '2937' from the file dataset/webqsp/split/val_indices.txt
? This is because dataset[2937]
contains an empty graph. We will fix this issue in the code later.
Regarding the training on the Scene Graphs dataset, yes, it does take a long time to train due to its large scale (100k data points in total). However, we have introduced early stopping, which will halt training when the validation loss stops decreasing after two epochs. In my case, it stops at the 5th or 6th epoch and takes approximately 12 hours to finish with 2 A100 80G GPUs.
Thanks for the reply! I have removed '2937' from the file and it runs smoothly.
Dear authors,
I meet exactly the same issue as "RuntimeError: selected index k out of range #10". Could the author double check this error by running the processing code from scratch instead of using the processed dataset?
Thanks~