CurryTang / TSGFM

MIT License
19 stars 0 forks source link

Code and Datasets for Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights

This is the code repo accompanying our paper "Text-space Graph Foundation Models: Comprehensive Benchmarks and New Insights."

We implement the following graph foundation model building blocks.

We support the following two scenarios.


pip install -r requirements.txt


We follow OneForAll's way of managing the datasets. We support the following datasets. Name #Graphs #Nodes #Edges Domains Tasks #classes
Cora 1 2708 10556 CS Citation Node, Link 7
CiteSeer 1 3186 8450 CS Citation Node, Link 6
Arxiv 1 169343 2315598 CS Citation Node, Link 40
Arxiv23 1 46198 77726 CS Citation Node, Link 40
History 1 41551 503180 E-commerce Node, Link 12
Child 1 76875 2325044 E-commerce Node, Link 24
Computers 1 87229 1256548 E-commerce Node, Link 10
Photo 1 48362 873782 E-commerce Node, Link 12
Sportsfit 1 173055 3020134 E-commerce Node, Link 13
Products 1 316513 19337722 E-commerce Node, Link 39
Amazon Ratings 1 24492 186100 E-commerce Node, Link 5
Pubmed 1 19717 88648 Bio Citation Node, Link 3
WikiCS 1 11701 431726 Knowledge Node, Link 10
Tolokers 1 11758 1038000 Anomaly Node, Link 2
DBLP 1 14376 431326 CS Citation Node, Link 4
CheMBL 365065 26 112 Biology Graph 1048
PCBA 437092 26 56 Biology Graph 128
HIV 41127 26 55 Biology Graph 2
Tox21 7831 19 39 Biology Graph 12
Bace 1513 34 74 Biology Graph 2
Bbbp 2039 24 52 Biology Graph 2
Muv 93087 24 53 Biology Graph 17
Toxcast 8575 19 39 Biology Graph 588

The processed file versions can be achieved from the following link.

Structures of the processed files: is the core storage object, and node_text_feat stores the processed node features. contains the index file used to query the attributes stored in A comprehensive introduction of each column can be found in OneForAll's repo.

To prepare the data, it's okay to generate all raw files yourself (run oneforall for 1 epoch, including all datasets). I recommend you use the preprocessed files directly and unzip them to the main directory.

Code Structures


Main entries

Reproduce the results



  1. Use to generate checkpoints
  2. Use or to generate the answer files for node/link-level tasks. For example, bash citeseer nc ./checkpoints/llaga-mistral-7b-hf-sbert-4-hop-token-linear-cora.3-citeseer.4-pubmed.3-nc-lp-projector/ citationcross
  3. Use to calculate the results


python3 --pre_train_datasets "cora-link" "citeseer-link" "pubmed-link" "arxiv-link" "arxiv23-link" "bookhis-link" "bookchild-link" "sportsfit-link" "products-link" "elecomp-link" "elephoto-link" --encoder gcn --num_layers 3 --num_hidden 128 --batch_size 512


python3 --pre_train_datasets cora citeseer arxiv arxiv23 bookhis bookchild elecomp elephoto sportsfit products pubmed wikics --model BUDDY --cache_subgraph_features --max_hash_hops 3 --epochs 50
python3 --pre_train_datasets cora --model SEALGCN --hidden_channels 256 --num_hops 3


Check the best hyper-parameter in the paper (use cpuinf can do full-batch inference on CPU, which is faster on our environment)

python3 --pre_train_datasets arxiv sportsfit products --method graphmae --num_heads 4 --num_out_heads 1 --num_layers 3 --num_hidden 1024 --residual --in_drop 0.5 --attn_drop 0.5 --norm 'batchnorm' --lr 0.01 --weight_decay 1e-5 --activation 'prelu' --mask_rate 0.75 --drop_edge_rate 0 --replace_rate 0.2 --scheduler --lrtype 'cosine' --save_model --max_epoch 5 --subgraph_size 1024 --warmup --cpuinf


pretrain on arxiv

python experiments/ --dataset arxiv --root <root> --original_features False -ds_cap 24000 -val_cap 100 -test_cap 100 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 30 -shot 3 -qry 4 -eval_step 5000 -task cls_nm_sb -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix MAG_PT_PRODIGY

test on History

python3 experiments/ --dataset bookhis --original_features True -ds_cap 300 -val_cap 300 -test_cap 300 --emb_dim 256 --epochs 1 -ckpt_step 1000 -layers S2,U,M -lr 3e-4 -way 12 -shot 3 -qry 4 -eval_step 50 -task cls_nm_sb  -bs 1 -aug ND0.5,NZ0.5 -aug_test True -attr 1000 --device 0 --prefix test --root <root> -pretrained <ckpt>


This code repo is heavily based on OneForAll(✨), BUDDY, LLaGA, GraphMAE, Prodigy, CSTAG. Thanks for their sharing!