Efficient Graph Similarity Computation - (EGSC)

This repo contains the source code and dataset for our NeurIPS 2021 paper:

Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation
2019 Conference on Neural Information Processing Systems (NeurIPS 2021)
paper

@inproceedings{qin2021slow,
              title={Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation},
              author={Qin, Can and Zhao, Handong and Wang, Lichen and Wang, Huan and Zhang, Yulun and Fu, Yun},
              booktitle={Thirty-Fifth Conference on Neural Information Processing Systems},
              year={2021}
            }

EGSC

Introduction

Graph Similarity Computation (GSC) is essential to wide-ranging graph appli- cations such as retrieval, plagiarism/anomaly detection, etc. The exact computation of graph similarity, e.g., Graph Edit Distance (GED), is an NP-hard problem that cannot be exactly solved within an adequate time given large graphs. Thanks to the strong representation power of graph neural network (GNN), a variety of GNN-based inexact methods emerged. To capture the subtle difference across graphs, the key success is designing the dense interaction with features fusion at the early stage, which, however, is a trade-off between speed and accuracy. For Slow Learning of graph similarity, this paper proposes a novel early-fusion approach by designing a co-attention-based feature fusion network on multilevel GNN features. To further improve the speed without much accuracy drop, we introduce an efficient GSC solution by distilling the knowledge from the slow early-fusion model to the student one for Fast Inference. Such a student model also enables the offline collection of individual graph embeddings, speeding up the inference time in orders. To address the instability through knowledge transfer, we decompose the dynamic joint embedding into the static pseudo individual ones for precise teacher-student alignment. The experimental analysis on the real-world datasets demonstrates the superiority of our approach over the state-of-the-art methods on both accuracy and efficiency. Particularly, we speed up the prior art by more than 10x on the benchmark AIDS data.

Dataset

We have used the standard dataloader, i.e., ‘GEDDataset’, directly provided in the PyG.

AIDS700nef: https://drive.google.com/uc?export=download&id=10czBPJDEzEDI2tq7Z7mkBjLhj55F-a2z

LINUX: https://drive.google.com/uc?export=download&id=1nw0RRVgyLpit4V4XFQyDy0pI6wUEXSOI

ALKANE: https://drive.google.com/uc?export=download&id=1-LmxaWW3KulLh00YqscVEflbqr0g4cXt

IMDBMulti: https://drive.google.com/uc?export=download&id=12QxZ7EhYA7pJiF4cO-HuE8szhSOWcfST

The code takes pairs of graphs for training from an input folder where each pair of graph is stored as a JSON. Pairs of graphs used for testing are also stored as JSON files. Every node id and node label has to be indexed from 0. Keys of dictionaries are stored strings in order to make JSON serialization possible.

Every JSON file has the following key-value structure:

{"graph_1": [[0, 1], [1, 2], [2, 3], [3, 4]],
 "graph_2":  [[0, 1], [1, 2], [1, 3], [3, 4], [2, 4]],
 "labels_1": [2, 2, 2, 2],
 "labels_2": [2, 3, 2, 2, 2],
 "ged": 1}

The **graph_1** and **graph_2** keys have edge list values which descibe the connectivity structure. Similarly, the **labels_1** and **labels_2** keys have labels for each node which are stored as list - positions in the list correspond to node identifiers. The **ged** key has an integer value which is the raw graph edit distance for the pair of graphs.

Requirements

The codebase is implemented in Python 3.6.12. package versions used for development are just below.

matplotlib        3.3.4
networkx          2.4
numpy             1.19.5
pandas            1.1.2
scikit-learn      0.23.2
scipy             1.4.1
texttable         1.6.3
torch             1.6.0
torch-cluster     1.5.9
torch-geometric   1.7.0
torch-scatter     2.0.6
torch-sparse      0.6.9
tqdm              4.60.0

File Structure

.
├── README.md
├── LICENSE                            
├── EGSC-T
│   ├── src
│   │    ├── egsc.py 
│   │    ├── layers.py
│   │    ├── main.py
│   │    ├── model.py
│   │    ├── parser.py        
│   │    └── utils.py                             
│   ├── README.md                      
│   └── train.sh
├── EGSC-KD
│   ├── src
│   │    ├── egsc_kd.py 
│   │    ├── egsc_nonkd.py 
│   │    ├── layers.py
│   │    ├── main_kd.py
│   │    ├── main_nonkd.py
│   │    ├── model_kd.py
│   │    ├── parser.py    
│   │    ├── trans_modules.py    
│   │    └── utils.py                             
│   ├── README.md  
│   ├── train_kd.md                     
│   └── train_nonkd.sh 
├── Checkpoints
│   ├── G_EarlyFusion_Disentangle_LINUX_gin_checkpoint.pth
│   ├── G_EarlyFusion_Disentangle_IMDBMulti_gin_checkpoint.pth
│   ├── G_EarlyFusion_Disentangle_ALKANE_gin_checkpoint.pth
│   └── G_EarlyFusion_Disentangle_AIDS700nef_gin_checkpoint.pth                         
└── GSC_datasets
    ├── AIDS700nef
    ├── ALKANE
    ├── IMDBMulti
    └── LINUX

To Do

[x] GED Datasets Processing
[x] Teacher Model Training
[x] Student Model Training
[x] Knowledge Distillation
[ ] Online Inference

Acknowledgement

We would like to thank the SimGNN and Extended-SimGNN which we used for this implementation.

Hint

On some datasets, the results are not quite stable. We suggest to run multiple times to report the avarage one.

canqin001 / Efficient_Graph_Similarity_Computation

readme