HuieL / GRAG

MIT License
8 stars 0 forks source link

dataset/webqsp/cached_desc/0.txt not found #4

Closed mlrus closed 2 weeks ago

mlrus commented 2 weeks ago

How can I resolve these errors? Is it versioning or a missing parameter? The dataset/webqsp/cached_desc/ files do not exist.

Steps:

  1. Generated files dataset/webqsp/graphs/*.pt with python -m src.dataset.preprocess.webqsp
  2. Create missing directory mkdir dataset/webqsp/cached_graphs
  3. Link graphs/ to cached_graphs/ cd dataset/webqsp; ln graphs/* cached_graphs

At this point there is a new error: dataset/webqsp/cached_desc/0.txt and there are not any #.txt files.

Here is the runtime output:

$ python train.py --dataset webqsp --model_name graph_llm --seed 3 inherit model weights from sentence-transformers/all-roberta-large-v1 /data2/_user_name_/anaconda3/envs/grag/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning:clean_up_tokenization_spaceswas not set. It will be set toTrueby default. This behavior will be depracted in transformers v4.45, and will be then set toFalseby default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 Namespace(model_name='graph_llm', project='projection', seed=3, dataset='webqsp', lr=1e-05, wd=0.05, patience=5, batch_size=2, grad_steps=2, num_epochs=10, warmup_epochs=1, eval_batch_size=16, llm_model_name='7b', llm_model_path='', llm_frozen='True', llm_num_virtual_tokens=10, output_dir='output', max_txt_len=512, max_new_tokens=32, gnn_model_name='gat', gnn_num_layers=4, gnn_in_dim=1024, gnn_hidden_dim=1024, alignment_mlp_layers=3, gnn_num_heads=4, distance_operator='euclidean', gnn_dropout=0.0) Traceback (most recent call last): File "/data2/_user_name_/gits/GRAG/train.py", line 139, in <module> main(args) File "/data2/_user_name_/gits/GRAG/train.py", line 34, in main train_dataset = [dataset[i] for i in idx_split['train']] File "/data2/_user_name_/gits/GRAG/train.py", line 34, in <listcomp> train_dataset = [dataset[i] for i in idx_split['train']] File "/data2/_user_name_/gits/GRAG/src/dataset/webqsp.py", line 41, in __getitem__ desc = open(f'{cached_desc}/{index}.txt', 'r').read() FileNotFoundError: [Errno 2] No such file or directory: 'dataset/webqsp/cached_desc/0.txt'

HuieL commented 2 weeks ago

The cached_graphs will be generated automatically with python -m src.dataset.expla_graphs. Our data processing logic is:

  1. python -m src.dataset.preprocess.webqsp will generate graphs using triples in the original data;
  2. python -m src.dataset.expla_graphs performs retrieval on graphs and restores the retrieved subgraph as the caches for further generation.

If the graph is retrieved again every time the model is trained, the efficiency of tuning the parameters will be greatly reduced. You can store subgraphs retrieved under different settings in different folders. You only need to change the data path to the corresponding folder during training.

Hope this helpful!