Closed mlrus closed 2 weeks ago
The cached_graphs
will be generated automatically with python -m src.dataset.expla_graphs
. Our data processing logic is:
python -m src.dataset.preprocess.webqsp
will generate graphs using triples in the original data;python -m src.dataset.expla_graphs
performs retrieval on graphs and restores the retrieved subgraph as the caches for further generation.If the graph is retrieved again every time the model is trained, the efficiency of tuning the parameters will be greatly reduced. You can store subgraphs retrieved under different settings in different folders. You only need to change the data path to the corresponding folder during training.
Hope this helpful!
How can I resolve these errors? Is it versioning or a missing parameter? The dataset/webqsp/cached_desc/ files do not exist.
Steps:
dataset/webqsp/graphs/*.pt
withpython -m src.dataset.preprocess.webqsp
mkdir dataset/webqsp/cached_graphs
cd dataset/webqsp; ln graphs/* cached_graphs
At this point there is a new error:
dataset/webqsp/cached_desc/0.txt
and there are not any #.txt files.Here is the runtime output:
$ python train.py --dataset webqsp --model_name graph_llm --seed 3 inherit model weights from sentence-transformers/all-roberta-large-v1 /data2/_user_name_/anaconda3/envs/grag/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning:
clean_up_tokenization_spaceswas not set. It will be set to
Trueby default. This behavior will be depracted in transformers v4.45, and will be then set to
Falseby default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 Namespace(model_name='graph_llm', project='projection', seed=3, dataset='webqsp', lr=1e-05, wd=0.05, patience=5, batch_size=2, grad_steps=2, num_epochs=10, warmup_epochs=1, eval_batch_size=16, llm_model_name='7b', llm_model_path='', llm_frozen='True', llm_num_virtual_tokens=10, output_dir='output', max_txt_len=512, max_new_tokens=32, gnn_model_name='gat', gnn_num_layers=4, gnn_in_dim=1024, gnn_hidden_dim=1024, alignment_mlp_layers=3, gnn_num_heads=4, distance_operator='euclidean', gnn_dropout=0.0) Traceback (most recent call last): File "/data2/_user_name_/gits/GRAG/train.py", line 139, in <module> main(args) File "/data2/_user_name_/gits/GRAG/train.py", line 34, in main train_dataset = [dataset[i] for i in idx_split['train']] File "/data2/_user_name_/gits/GRAG/train.py", line 34, in <listcomp> train_dataset = [dataset[i] for i in idx_split['train']] File "/data2/_user_name_/gits/GRAG/src/dataset/webqsp.py", line 41, in __getitem__ desc = open(f'{cached_desc}/{index}.txt', 'r').read() FileNotFoundError: [Errno 2] No such file or directory: 'dataset/webqsp/cached_desc/0.txt'