GraphTCM

This is a Pytorch code Implementation of the paper Exploring Correlations of Self-Supervised Tasks for Graphs, which is accepted by the ICML 2024. We quantitatively characterize the correlations between different graph self-supervision tasks and obtain more effective graph self-supervised representations with our proposed GraphTCM.

Installation

We used the following packages under Python 3.10.

pytorch 2.1.1
torch-geometric 2.4.0
matplotlib 3.5.0
pandas 2.1.3

Base Tasks

Existing graph self-supervised methods can be categorized into four primary: feature-based (FB), structure-based (SB), auxiliary property-based (APB) and contrast-based (CB). To comprehensively understand the complex relationships in graph self-supervised tasks, we have chosen two representative methods from each category for detailed analysis.

GraphComp (https://github.com/Shen-Lab/SS-GCNs). Its objective is to reconstruct the masked features, teaching the network to extract features from the context.
AttributeMask (https://github.com/ChandlerBang/SelfTask-GNN). It aims to reconstruct the dense feature matrix generated by Principal Component Analysis (PCA) rather than the raw features.
GAE (https://github.com/DaehanKim/vgae_pytorch). It aims to reconstruct the adjacency matrix using the node representations.
EdgeMask (https://github.com/ChandlerBang/SelfTask-GNN). It aims to acquire finer-grained local structural information by employing link prediction as a pretext task.
NodeProp (https://github.com/ChandlerBang/SelfTask-GNN). It utilizes a node-level pretext task, predicting properties for individual nodes, including attributes such as degree, local node importance, and local clustering coefficient.
DisCluster (https://github.com/ChandlerBang/SelfTask-GNN). It performs regression on the distances between each node and predefined graph clusters.
DGI (https://github.com/PetarV-/DGI). It maximizes mutual information between representations from subgraphs with differing scales, facilitating the graph encoder in attaining a comprehensive grasp of both localized and global semantic information.
SubgCon (https://github.com/yzjiao/Subg-Con). It captures regional structural insights by capitalizing on the robust correlation between central nodes and their sampled subgraphs.

We provide the representations obtained from training using these eight self-supervised methods across various datasets, located in the directory emb/.

Correlation Value

Given two self-supervised tasks $t_1,t_2\in \mathcal{T}$, a graph $\mathcal{G}:(\mathbf{A},\mathbf{X})$, we define the correlation value $\text{Cor}(t_1,t_2)$ as:

$\text{Cor}(t_1,t_2)=\frac{\min_{\mathbf{W}_{t_1}}\boldsymbol{l}_{t_2}(\mathbf{H}_{t_1}\cdot\mathbf{W}_{t_1},\mathbf{Y}_{t_2})}{\min_{\mathbf{W}_{t_2}}\boldsymbol{l}_{t_2}(\mathbf{H}_{t_2}\cdot\mathbf{W}_{t_2},\mathbf{Y}_{t_2})}$

We provide the correlation values for various self-supervised tasks across different datasets in train_GraphTCM.py.

Training GraphTCM

Please run train_GraphTCM.py to train a GraphTCM model on the specific dataset.

usage: train_GraphTCM.py [-h] [--hidden_dim HIDDEN_DIM] [--pooling POOLING] [--device_num DEVICE_NUM] [--epoch_num EPOCH_NUM] [--lr LR] [--seed SEED] [--valid_rate VALID_RATE] [--dataset DATASET]

PyTorch implementation for building the correlation.

options:
  -h, --help                        show this help message and exit
  --hidden_dim HIDDEN_DIM           hidden dimension
  --pooling POOLING                 pooling type
  --device_num DEVICE_NUM           device number
  --epoch_num EPOCH_NUM             epoch number
  --lr LR                           learning rate
  --seed SEED                       random seed
  --valid_rate VALID_RATE           validation rate
  --dataset DATASET                 dataset

Training Representations

After training a GraphTCM model, please run train_emb.py to obtain more effective self-supervised representations. To facilitate further experiments, we also provide the trained representations based on GraphTCM in the emb/ directory, all named GraphTCM.pkl.

usage: train_emb.py [-h] [--hidden_dim HIDDEN_DIM] [--device_num DEVICE_NUM] [--epoch_num EPOCH_NUM] [--lr LR] [--seed SEED] [--dataset DATASET] [--path PATH] [--target TARGET] [--train_method TRAIN_METHOD]

PyTorch implementation for training the representations.

options:
  -h, --help                        show this help message and exit
  --hidden_dim HIDDEN_DIM           hidden dimension
  --device_num DEVICE_NUM               device number
  --epoch_num EPOCH_NUM                 epoch number
  --lr LR                           learning rate
  --seed SEED                       random seed
  --dataset DATASET                 dataset
  --path PATH                       path for the trained GraphTCM model
  --target TARGET                   training target (ones or zeros)
  --train_method TRAIN_METHOD           training method

Downstream Adaptations

We have provided scripts with hyper-parameter settings to reproduce the experimental results presented in our paper. Please run run.sh under downstream/ to obtain the downstream results across various datasets.

cd downstream/
sh run.sh

Citation

You can cite our paper by following bibtex.

@inproceedings{Fang2024ExploringCO,
  title={Exploring Correlations of Self-supervised Tasks for Graphs},
  author={Taoran Fang and Wei Zhou and Yifei Sun and Kaiqiao Han and Lvbin Ma and Yang Yang},
  booktitle={International Conference on Machine Learning},
  year={2024}
}

LuckyTiger123 / GraphTCM

readme