This repo contains the source code and dataset for our NeurIPS 2021 paper:
Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation
2019 Conference on Neural Information Processing Systems (NeurIPS 2021)
paper
We have used the standard dataloader, i.e., ‘GEDDataset’, directly provided in the PyG.
AIDS700nef:
https://drive.google.com/uc?export=download&id=10czBPJDEzEDI2tq7Z7mkBjLhj55F-a2z
LINUX:
https://drive.google.com/uc?export=download&id=1nw0RRVgyLpit4V4XFQyDy0pI6wUEXSOI
ALKANE:
https://drive.google.com/uc?export=download&id=1-LmxaWW3KulLh00YqscVEflbqr0g4cXt
IMDBMulti:
https://drive.google.com/uc?export=download&id=12QxZ7EhYA7pJiF4cO-HuE8szhSOWcfST
The code takes pairs of graphs for training from an input folder where each pair of graph is stored as a JSON. Pairs of graphs used for testing are also stored as JSON files. Every node id and node label has to be indexed from 0. Keys of dictionaries are stored strings in order to make JSON serialization possible.
Every JSON file has the following key-value structure:
{"graph_1": [[0, 1], [1, 2], [2, 3], [3, 4]],
"graph_2": [[0, 1], [1, 2], [1, 3], [3, 4], [2, 4]],
"labels_1": [2, 2, 2, 2],
"labels_2": [2, 3, 2, 2, 2],
"ged": 1}
The **graph_1** and **graph_2** keys have edge list values which descibe the connectivity structure. Similarly, the **labels_1** and **labels_2** keys have labels for each node which are stored as list - positions in the list correspond to node identifiers. The **ged** key has an integer value which is the raw graph edit distance for the pair of graphs.
The codebase is implemented in Python 3.6.12. package versions used for development are just below.
matplotlib 3.3.4
networkx 2.4
numpy 1.19.5
pandas 1.1.2
scikit-learn 0.23.2
scipy 1.4.1
texttable 1.6.3
torch 1.6.0
torch-cluster 1.5.9
torch-geometric 1.7.0
torch-scatter 2.0.6
torch-sparse 0.6.9
tqdm 4.60.0
.
├── README.md
├── LICENSE
├── EGSC-T
│ ├── src
│ │ ├── egsc.py
│ │ ├── layers.py
│ │ ├── main.py
│ │ ├── model.py
│ │ ├── parser.py
│ │ └── utils.py
│ ├── README.md
│ └── train.sh
├── EGSC-KD
│ ├── src
│ │ ├── egsc_kd.py
│ │ ├── egsc_nonkd.py
│ │ ├── layers.py
│ │ ├── main_kd.py
│ │ ├── main_nonkd.py
│ │ ├── model_kd.py
│ │ ├── parser.py
│ │ ├── trans_modules.py
│ │ └── utils.py
│ ├── README.md
│ ├── train_kd.md
│ └── train_nonkd.sh
├── Checkpoints
│ ├── G_EarlyFusion_Disentangle_LINUX_gin_checkpoint.pth
│ ├── G_EarlyFusion_Disentangle_IMDBMulti_gin_checkpoint.pth
│ ├── G_EarlyFusion_Disentangle_ALKANE_gin_checkpoint.pth
│ └── G_EarlyFusion_Disentangle_AIDS700nef_gin_checkpoint.pth
└── GSC_datasets
├── AIDS700nef
├── ALKANE
├── IMDBMulti
└── LINUX
We would like to thank the SimGNN and Extended-SimGNN which we used for this implementation.
On some datasets, the results are not quite stable. We suggest to run multiple times to report the avarage one.