HKUDS / GraphGPT

[SIGIR'2024] "GraphGPT: Graph Instruction Tuning for Large Language Models"
https://arxiv.org/abs/2310.13023
Apache License 2.0
632 stars 59 forks source link
graph-learning graph-neural-networks instruction-tuning large-language-models text-graph

GraphGPT: Graph Instruction Tuning for Large Language Models

Jiabin Tang, Yuhao Yang, Wei Wei, Lei Shi, Suqi Cheng, Dawei Yin and Chao Huang*. (*Correspondence ) Data Intelligence Lab@University of Hong Kong, Baidu Inc. [![YouTube](https://badges.aleen42.com/src/youtube.svg)](#) This repository hosts the code, data and model weight of **GraphGPT** (SIGIR'24 full paper track).

๐ŸŽ‰ News

0. Environment Update:

The lightweight training requires PyTorch 2.1+, so we need to update corresponding libraries:

# if you have set up the env for GraphGPT earlier
pip uninstall torch
pip uninstall torchvision
pip uninstall torchaudio
# CUDA 11.8
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# update pyg for the PyTorch 2.1+
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.1.0+cu118.html

# install lightning
pip install lightning

1. Update the Graph Data

Due to compatibility issues, if you are using the previously released graph data, we recommend downloading and updating it according to the provided link: updated graph data.

2. Run the Scripts

You can run the scripts as follow:

Stage-1:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage1.sh

Stage-2:

cd path/to/GraphGPT
sh ./scripts/tune_script/graphgpt_stage2.sh
FQA - For 'pretrain_graph_model_path' is not defined. Please refer to issue [#7](https://github.com/HKUDS/GraphGPT/issues/7). - If there is something wrong for you to use flash attetion, just comment the `replace_llama_attn_with_flash_attn()` in line 8 in https://github.com/HKUDS/GraphGPT/blob/main/graphgpt/train/train_mem.py. For more details, please refer to [#17](https://github.com/HKUDS/GraphGPT/issues/17) - If you meet some error about package conflict or environment setup (especially fastchat), please refer to issue [#9](https://github.com/HKUDS/GraphGPT/issues/9) and issue [#11](https://github.com/HKUDS/GraphGPT/issues/11). - If you meet `No module named 'graphgpt'` error, you could refer to issue [#56](https://github.com/HKUDS/GraphGPT/issues/56)

๐ŸŽฏ๐ŸŽฏ๐Ÿ“ข๐Ÿ“ข We have made significant updates to the models and data used in our GraphGPT on ๐Ÿค— Huggingface. We highly recommend referring to the table below for further details:

๐Ÿค— Huggingface Address ๐ŸŽฏ Description
huggingface.co/Jiabin99/GraphGPT-7B-mix-all It's the checkpoint of our GraphGPT based on Vicuna-7B-v1.5 tuned on instruction data Arxiv-PubMed-mix-NC-LP
huggingface.co/Jiabin99/Arxiv-PubMed-GraphCLIP-GT It's the checkpoint of the pre-trained graph transformer (GT) trained on Arxiv and PubMed using Text-Graph grounding.
huggingface.co/datasets/Jiabin99/Arxiv-PubMed-mix-NC-LP This's the mixing instruction dataset with node classification (NC) and link prediction (LP) on Arxiv and PubMed.
huggingface.co/datasets/Jiabin99/GraphGPT-eval-instruction We release all instruction dataset for our evaluation.
huggingface.co/datasets/Jiabin99/All_pyg_graph_data We merge all utilized graph data.
huggingface.co/datasets/Jiabin99/graph-matching This is the instruction data used in graph-matching stage.

๐Ÿ‘‰ TODO


Brief Introduction

we present the GraphGPT framework that aligns LLMs with graph structural knowledge with a graph instruction tuning paradigm.

For more technical details, kindly refer to the paper and the project website of our Graph.


Getting Started

Table of Contents:


1. Code Structure [Back to Top]

.
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ assets
โ”‚ย ย  โ”œโ”€โ”€ demo_narrow.gif
โ”‚ย ย  โ”œโ”€โ”€ screenshot_cli.png
โ”‚ย ย  โ”œโ”€โ”€ screenshot_gui.png
โ”‚ย ย  โ”œโ”€โ”€ server_arch.png
โ”‚ย ย  โ””โ”€โ”€ vicuna_logo.jpeg
โ”œโ”€โ”€ format.sh
โ”œโ”€โ”€ graphgpt
โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”œโ”€โ”€ constants.py
โ”‚ย ย  โ”œโ”€โ”€ conversation.py
โ”‚ย ย  โ”œโ”€โ”€ eval
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ requirements.txt
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ run_graphgpt.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ run_graphgpt_LP.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ run_vicuna.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ script
โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ run_model_qa.yaml
โ”‚ย ย  โ”œโ”€โ”€ model
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ GraphLlama.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ apply_delta.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ apply_lora.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ builder.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ compression.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ convert_fp16.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ graph_layers
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ bpe_simple_vocab_16e6.txt.gz
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ clip_graph.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ graph_transformer.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ mpnn.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ simple_tokenizer.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ make_delta.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ model_adapter.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ model_registry.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ monkey_patch_non_inplace.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”‚ย ย  โ”œโ”€โ”€ protocol
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ openai_api_protocol.py
โ”‚ย ย  โ”œโ”€โ”€ serve
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ __init__.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ api_provider.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ bard_worker.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ cacheflow_worker.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ cli.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ controller.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gateway
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ nginx.conf
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_block_arena_anony.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_block_arena_named.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_css.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_patch.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_web_server.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ gradio_web_server_multi.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ huggingface_api.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ inference.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ model_worker.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ monitor
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ basic_stats.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ clean_battle_data.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ elo_analysis.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ hf_space_leaderboard_app.py
โ”‚ย ย  โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ monitor.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ openai_api_server.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ register_worker.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test_message.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ test_throughput.py
โ”‚ย ย  โ”œโ”€โ”€ train
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ graphchat_trainer.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ llama_flash_attn_monkey_patch.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train_graph.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ train_lora.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ train_mem.py
โ”‚ย ย  โ””โ”€โ”€ utils.py
โ”œโ”€โ”€ playground
โ”‚ย ย  โ”œโ”€โ”€ inspect_conv.py
โ”‚ย ย  โ”œโ”€โ”€ test_embedding
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ README.md
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test_classification.py
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ test_semantic_search.py
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ test_sentence_similarity.py
โ”‚ย ย  โ””โ”€โ”€ test_openai_api
โ”‚ย ย      โ”œโ”€โ”€ anthropic_api.py
โ”‚ย ย      โ””โ”€โ”€ openai_api.py
โ”œโ”€โ”€ pyproject.toml
โ”œโ”€โ”€ scripts
โ”‚ย ย  โ”œโ”€โ”€ eval_script
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ graphgpt_eval.sh
โ”‚ย ย  โ”œโ”€โ”€ extract_graph_projector.py
โ”‚ย ย  โ”œโ”€โ”€ serving
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ controller.yaml
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ model_worker.yaml
โ”‚ย ย  โ””โ”€โ”€ tune_script
โ”‚ย ย      โ”œโ”€โ”€ extract_projector.sh
โ”‚ย ย      โ”œโ”€โ”€ graphgpt_stage1.sh
โ”‚ย ย      โ””โ”€โ”€ graphgpt_stage2.sh
โ””โ”€โ”€ tests
    โ”œโ”€โ”€ test_openai_curl.sh
    โ”œโ”€โ”€ test_openai_langchain.py
    โ””โ”€โ”€ test_openai_sdk.py

2. Environment Preparation [Back to Top]

Please first clone the repo and install the required environment, which can be done by running the following commands:

conda create -n graphgpt python=3.8

conda activate graphgpt

# Torch with CUDA 11.7
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
# To support vicuna base model
pip3 install "fschat[model_worker,webui]"
# To install pyg and pyg-relevant packages
pip install torch_geometric
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-1.13.0+cu117.html
# Clone our GraphGPT
git clone https://github.com/HKUDS/GraphGPT.git
cd GraphGPT
# Install required libraries
pip install -r requirements.txt

3. Training GraphGPT [Back to Top]

GraphGPT tuning paradigm consists of two stages: (1) self-supervised instruction tuning; (2) task-specific instruction tuning.

3.1. Preparing Pre-trained Checkpoint [Back to Top]

GraphGPT is trained based on following excellent existing models. Please follow the instructions to prepare the checkpoints.

3.2. Self-Supervised Instruction Tuning [Back to Top]

# to fill in the following path to run the first stage of our GraphGPT!
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_1/graph_matching.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
output_model=./checkpoints/stage_1

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end \
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --report_to wandb

3.3. Extract the Trained Projector [Back to Top]

We could extract the trained projector in the stage 1 by filling blanks at extract_projector.sh. There is an example as below:

# to fill in the following path to extract projector for the first tuning stage!
src_model=./checkpoints/stage_1
output_proj=./checkpoints/stage_1_projector/stage_1_projector.bin

python3.8 ./scripts/extract_graph_projector.py \
  --model_name_or_path ${src_model} \
  --output ${output_proj}

3.4. Task-Specific Instruction Tuning [Back to Top]

# to fill in the following path to run the second stage of our GraphGPT!
model_path=../vicuna-7b-v1.5-16k
instruct_ds=./data/stage_2/data_all_mix.json
graph_data_path=./graph_data/all_graph_data.pt
pretra_gnn=clip_gt_arxiv
tuned_proj=./checkpoints/stage_1_projector/stage_1_projector.bin
output_model=./checkpoints/stage_2

wandb offline
python -m torch.distributed.run --nnodes=1 --nproc_per_node=4 --master_port=20001 \
    graphgpt/train/train_mem.py \
    --model_name_or_path ${model_path} \
    --version v1 \
    --data_path ${instruct_ds} \
    --graph_content ./arxiv_ti_ab.json \
    --graph_data_path ${graph_data_path} \
    --graph_tower ${pretra_gnn} \
    --pretrain_graph_mlp_adapter ${tuned_proj} \
    --tune_graph_mlp_adapter True \
    --graph_select_layer -2 \
    --use_graph_start_end True\
    --bf16 True \
    --output_dir ${output_model} \
    --num_train_epochs 2 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

4. Evaluating GraphGPT [Back to Top]

4.1. Preparing Checkpoints and Data [Back to Top]

4.2. Running Evaluation [Back to Top]

You could start the second stage tuning by filling blanks at graphgpt_eval.sh. There is an example as below:

# to fill in the following path to extract projector for the second tuning stage!
output_model=./checkpoints/stage_2
datapath=./data/eval/arxiv_nc.json
graph_data_path=./graph_data/all_graph_data.pt
res_path=./output_stage_2_arxiv_nc
start_id=0
end_id=20000
num_gpus=2

python3.8 ./graphgpt/eval/run_graphgpt.py --model-name ${output_model}  --prompting_file ${datapath} --graph_data_path ${graph_data_path} --output_res_path ${res_path} --start_id ${start_id} --end_id ${end_id} --num_gpus ${num_gpus}

Contact

For any questions or feedback, feel free to contact Jiabin Tang.

Misc

[![Stargazers repo roster for @HKUDS/GraphGPT](https://reporoster.com/stars/HKUDS/GraphGPT)](https://github.com/HKUDS/GraphGPT/stargazers) [![Forkers repo roster for @HKUDS/GraphGPT](https://reporoster.com/forks/HKUDS/GraphGPT)](https://github.com/HKUDS/GraphGPT/network/members) [![Star History Chart](https://api.star-history.com/svg?repos=HKUDS/GraphGPT&type=Date)](https://star-history.com/#HKUDS/GraphGPT&Date)

Citation

If you find GraphGPT useful in your research or applications, please kindly cite:

@articles{tang2023graphgpt,
title={GraphGPT: Graph Instruction Tuning for Large Language Models}, 
author={Jiabin Tang and Yuhao Yang and Wei Wei and Lei Shi and Lixin Su and Suqi Cheng and Dawei Yin and Chao Huang},
year={2023},
eprint={2310.13023},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

Acknowledgements

You may refer to related work that serves as foundations for our framework and code repository, Vicuna, LLaVa, We also partially draw inspirations from MiniGPT-4. For the text-graph grounding design, we leverages implementation from G2P2. The design of our website and README.md was inspired by NExT-GPT. Thanks for their wonderful works.