Documentation for Dataset Preparation

samveldfwork commented 3 weeks ago

Hello,

Thank you for your great work, I managed to run the code with the provided dataset. However, I have been exploring the repository as well as the paper and noticed that the method for preparing the dataset is not clearly documented. While there are several methods and scripts related to data processing, a comprehensive guide or documentation on how to prepare the dataset using our own data for training and evaluation is missing.

Specifically, it would be much appreciated if you could provide detailed instructions on:

The format and structure of the input data.
Steps to preprocess and prepare the dataset.
Any specific requirements or dependencies needed for dataset preparation.
Examples of commands or scripts to run for dataset preparation, especially about how to build and embed the subgraphs.

Thank you for your attention to this matter.

RidhiChhajer commented 2 weeks ago

is the subgraph retrieved every time the query is asked?

cmavro commented 2 weeks ago

Thanks for your suggestions!

Here are the basic steps:

GNN:

Entity Linking: The question entities are linked to the KG.
Subgraph Extraction: The KG subgraphs (e.g., 4-hop neighbors) are extracted based on the linked entities.

For WebQSP and CWQ, we follow the algorithm of NSM. You can find their preprocessing steps here. After executed, the input data file (json format, e.g., test.json) consists of the following fields; question, seed_entities (obtained via entity linking), subgraph tuples (obtained via subgraph extraction -- these are in the format (head id, relation id, tail id) or (head name, relation, tail name)), answer.

Doing the above steps for train and test questions will result into the necessary data files to train your GNN. We train the ReaRev GNN described here, but you can use different ones.

RAG:

Please, run inference with the GNN as described here. This will generate the candidate answers obtained by the GNN in the right format.

For RAG, we follow RoG (their github code looks deactivated at the moment) as described in the GNN-RAG/llm folder. The shortest paths obtained by the GNN are verbalized and concatenated at the input to produce the LLM predictions.

Overall:

KG subgraphs are needed for each question for GNN training/evaluation. Then, the shortest paths between question entities and answer candidates are extracted for RAG. In our work, we follow previous works (GraftNet, NSM) and their preprocessing steps to get the KG subgraphs for WebQSP and CWQ from Freebase KG. If you need to test your own data, you should follow similar data preprocessing steps.

I will try to upload code samples in the next weeks, thanks for your patience!

RidhiChhajer commented 2 weeks ago

Thank you for your reply. Looking forward to your uploads next week :)

RidhiChhajer commented 2 weeks ago

@cmavro how to use the MetaQA data? Do I need to change the pipeline or the data format?

cmavro / GNN-RAG

Documentation for Dataset Preparation #1