TRAIS-Lab / LLM-Structured-Data

25 stars 0 forks source link

Can LLMs Effectively Leverage Graph Structural Information: When and Why


We provide three main components:

1. New Dataset: arxiv-2023

arxiv-2023 is collected to be compared with ogbn-arxiv. Both datasets represent directed citation networks where each node corresponds to a paper published on arXiv and each edge indicates one paper citing another.

Statistics of ogbn-arxiv and arxiv-2023 datasets

Dataset #Nodes (Full Dataset) #Edges (Full Dataset) In-Degree/Out-Degree (Test Set) Average Degree (Test Set) Published Year (Test Set)
ogbn-arxiv 169343 1166243 1.33/11.1 12.43 2019
arxiv-2023 33868 305672 0.16/10.6 10.76 2023

Proportional distribution of labels in ogbn-arxiv and arxiv-2023 datasets. Each label represents an arXiv Computer Science Category.

Proportional Distribution of Labels in OGBN-ARXIV and ARXIV Datasets

2. Unified Dataloader for Datasets and Raw Text

Download Datasets and Raw Text

We provide the dataset and raw text for arxiv-2023 in this repo. You may need to download the dataset and raw text for other datasets.

Set up environment and OpenAI API key

You need to set up your OpenAI API key as OPENAI_API_KEY environment variable. See here for details.

Required packages include openai, pytorch, PyG, ogb etc.

Data Loading API

>>> from utils.utils import load_data
>>> data, text = load_data("arxiv_2023", use_text=True)
>>> print(data)
Data(x=[33868, 128], edge_index=[2, 305672], y=[33868, 1], paper_id=[33868], train_mask=[33868], val_mask=[33868], test_mask=[33868], num_nodes=33868, train_id=[19461], val_id=[4682], test_id=[668])
>>> print(text.keys())
dict_keys(['title', 'abs', 'label', 'id'])


If you find this repo helpful for your research, please consider citing our paper below.

      title={Can LLMs Effectively Leverage Graph Structural Information: When and Why}, 
      author={Jin Huang and Xingjian Zhang and Qiaozhu Mei and Jiaqi Ma},