Marigoldwu / A-Unified-Framework-for-Deep-Attribute-Graph-Clustering

This project is a scalable unified framework for deep graph clustering.
https://www.marigold.website/readArticle?workId=145&author=Marigold&authorId=1000001
MIT License
90 stars 12 forks source link
data-mining deep-clustering deep-graph-clustering deep-graph-clustering-framework deep-learning graph-attention-networks graph-clustering graph-convolutional-neural-networks machine-learning self-supervised-learning
logo

🚀 A-Unified-Framework-for-Deep-Attribute-Graph-Clustering

☞ See the Chinese version in [Marigold]

Recently, deep attribute graph clustering has developed rapidly. At the same time various methods have sprung up. Although most of the methods are open-source, it is a pity that these codes do not have a unified framework, which makes researchers have to spend a lot of time modifying the code to achieve the purpose of reproduction. Fortunately, Liu et al. [Homepage: yueliu1999] organized the deep graph clustering method into a code warehouse—— Awesome-Deep-Graph-Clustering(ADGC). For example, they provided more than 20 datasets and unified the format. Moreover, they list the most related paper about deep graph clustering and give the link of source code. It is worth mentioning that they organize the code of deep graph clustering into rand-augmentation-model-clustering-visualization-utils structure, which greatly facilitates beginners and researchers. Here, on behalf of myself, I would like to express my sincere thanks and high respect to Liu et al.

❤️ Acknowledgements:

Thanks for the open source of these authors (not listed in order):

[ yueliu1999 | bdy9527| Liam Liu | Zhihao PENG | William Zhu | WxTu ]

[ xihongyang1999 | gongleii ]

yueliu1999 bdy9527 Liam Liu Zhihao PENG William Zhu WxTu

xihongyang1999 gongleii

🍉 Introduction

On the basis of ADGC, I refactored the code to make the deep clustering code achieve a higher level of unification. Specifically, I redesigned the architecture of the code, so that you can run the open source code easily. I defined some tool classes and functions to simplify the code and make the settings' configuration clear.

🍓 Quick Start

After git clone the code, you can follow the steps below to run:

✈️ Step 1: Check the environment or run the requirements.txt to install the libraries directly.

pip install -r requirements.txt

✈️ Step 2: Prepare the datasets. If you don't have the datasets, you can download them from Liu's warehouse [yueliu1999 | Google Drive | Nutstore]. Then unzip them to the dataset directory.

✈️ Step 3: Run the file in the directory where main.py is located in command line. If it is in the integrated compilation environment, you can directly run the main.py file.

:star: Examples

Example 1

Take the training of the DAEGC as example:

:one: pretrain GAT:

python main.py --pretrain --model pretrain_gat_for_daegc --dataset acm  --t 2 --desc pretrain_the_GAT_for_DAEGC_on_acm
# or the simplified command:
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS pretrain_the_GAT_for_DAEGC_on_acm

:two: train DAEGC:

python main.py --model DAEGC --dataset cora --t 2 -desc Train_DAEGC_1_iteration_on_the_ACM_dataset
# or the simplified command:
python main.py -M DAEGC -D cora -T 2 -DS Train_DAEGC_1_iteration_on_the_ACM_dataset

Example 2

Take the training of the SDCN as example:

:one: pretrain AE:

python main.py --pretrain --model pretrain_ae_for_sdcn --dataset acm --desc pretrain_ae_for_SDCN_on_acm
# or simplified command:
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS pretrain_ae_for_SDCN_on_acm

:two: train SDCN:

python main.py --model SDCN --dataset acm --norm --desc Train_SDCN_1_iteration_on_the_ACM_dataset
# or simplified command:
python main.py -M SDCN -D acm -N  -DS Train_SDCN_1_iteration_on_the_ACM_dataset

✈️ Step 4: If you run the code successfully, don't forget give me a star! :wink:

🔓 Currently Supported Models

No. Model Paper Analysis Source Code
1 DAEGC 《Attributed Graph Clustering:
A Deep Attentional Embedding Approach》
论文阅读02 link
2 SDCN 《Structural Deep Clustering Network》 论文阅读03 link
3 AGCN 《Attention-driven Graph Clustering Network》 论文阅读04 link
4 EFR-DGC 《Deep Graph clustering with enhanced
feature representations for community detection》
论文阅读12 link
5 GCAE :exclamation: ​In fact, it's GAE with GCN. - -
6 DFCN 《Deep Fusion Clustering Network》 论文阅读09 link
7 HSAN 《Hard Sample Aware Network for
Contrastive Deep Graph Clustering》
- link
8 DCRN 《Deep Graph Clustering via
Dual Correlation Reduction》
- link
9 CCGC 《Cluster-guided Contrastive
Graph Clustering Network》
- link
10 AGC-DRR 《Attributed Graph Clustering
with Dual Redundancy Reduction》
- link

:exclamation: Attention

  1. The training process of DFCN are divided into three stages according to the paper. First, pretrain pretrain_ae_for_dfcn and pretrain_igae_for_dfcn separately for 30 epochs. Second, pretrain ae and igae simultaneously for 100 epochs which are both integrated into pfretrain_both_for_dfcn. Finally, train DFCN formally at least 200 epochs. So is DCRN!
  2. The HSAN model does not require pretraining.
  3. The results in the DCRN paper have not yet been reproduced, and will continue to be updated in the future.

In the future, I plan to update the other models. If you find my framework useful, feel free to contribute to its improvement by submitting your own code.

🔓 TODO

No. Model Paper Analysis Source Code
1 SCGC 《Simple Contrastive Graph Clustering》 - link
2 Dink-Net 《Dink-Net: Neural Clustering on Large Graphs》 - link

:robot: ​Commands

:alien: ​DAEGC

# pretrain
python main.py -P -M pretrain_gat_for_daegc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M DAEGC -D acm -T 2 -DS balabala -LS 1 -TS -H

:alien: ​SDCN

# pretrain
python main.py -P -M pretrain_ae_for_sdcn -D acm -DS balabala -LS 1
# train
python main.py -M SDCN -D acm -N -DS balabala -LS 1 -TS -H

:alien: ​AGCN

# pretrain
python main.py -P -M pretrain_ae_for_agcn -D acm -DS balabala -LS 1
# train
python main.py -M AGCN -D acm -N -SF -DS balabala -LS 1 -TS -H

:alien: ​EFR-DGC

# pretrain
python main.py -P -M pretrain_ae_for_efrdgc -D acm -DS balabala -LS 1
python main.py -P -M pretrain_gat_for_efrdgc -D acm -T 2 -DS balabala -LS 1
# train
python main.py -M EFRDGC -D acm -T 2 -DS balabala -LS 1 -TS -H

:alien: ​GCAE

# pretrain
python main.py -P -M pretrain_gae_for_gcae -D acm -N -DS balabala -LS 1
# train
python main.py -M GCAE -D acm -N -DS balabala -LS 1 -TS -H

:alien: ​DFCN

# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dfcn -D acm -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dfcn -D acm -N -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dfcn -D acm -N -DS balabala -LS 1
# train
python main.py -M DFCN -D acm -N -DS balabala -LS 1 -TS -H

:alien: HSAN

# train
python main.py -M HSAN -D cora -SLF -A npy -F npy -DS balabala -LS 1 -TS

:alien: DCRN

# pretrain. Execute the following commands in sequence.
python main.py -P -M pretrain_ae_for_dcrn -D acm -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_igae_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
python main.py -P -M pretrain_both_for_dcrn -D acm -N -SF -S 1 -DS balabala -LS 1
# train
python main.py -M DCRN -D acm -SLF -A npy -S 3 -DS balabala -LS 1 -TS -H

:alien: CCGC

python main.py -M CCGC -D acm -SLF -SF -A npy -S 0 -LS 1 -DS balabala

:alien: AGC-DRR

python main.py -M AGCDRR -D acm -F npy -S 0 -LS 1 -DS balabala

🍊 Advanced

:exclamation: ​Arguments

🥤 Help

> python main.py --help
usage: main.py [-h] [-P] [-TS] [-H] [-N] [-SLF] [-SF] [-DS DESC]
               [-M MODEL_NAME] [-D DATASET_NAME] [-R ROOT] [-K K] [-T T]
               [-LS LOOPS] [-F {tensor,npy}] [-L {tensor,npy}]
               [-A {tensor,npy}] [-S SEED]

Scalable Unified Framework of Deep Graph Clustering

optional arguments:
  -h, --help            show this help message and exit
  -P, --pretrain        Whether to pretrain. Using '-P' to pretrain.
  -TS, --tsne           Whether to draw the clustering tsne image. Using '-TS'
                        to draw clustering TSNE.
  -H, --heatmap         Whether to draw the embedding heatmap. Using '-H' to
                        draw embedding heatmap.
  -N, --norm            Whether to normalize the adj, default is False. Using
                        '-N' to load adj with normalization.
  -SLF, --self_loop_false
                        Whether the adj has self-loop, default is True. Using
                        '-SLF' to load adj without self-loop.
  -SF, --symmetric_false
                        Whether the normalization type is symmetric. Using
                        '-SF' to load asymmetric adj.
  -DS DESC, --desc DESC
                        The description of this experiment.
  -M MODEL_NAME, --model MODEL_NAME
                        The model you want to run.
  -D DATASET_NAME, --dataset DATASET_NAME
                        The dataset you want to use.
  -R ROOT, --root ROOT  Input root path to switch relative path to absolute.
  -K K, --k K           The k of KNN.
  -T T, --t T           The order in GAT. 'None' denotes don't calculate the
                        matrix M.
  -LS LOOPS, --loops LOOPS
                        The Number of training rounds.
  -F {tensor,npy}, --feature {tensor,npy}
                        The datatype of feature. 'tenor' and 'npy' are
                        available.
  -L {tensor,npy}, --label {tensor,npy}
                        The datatype of label. 'tenor' and 'npy' are
                        available.
  -A {tensor,npy}, --adj {tensor,npy}
                        The datatype of adj. 'tenor' and 'npy' are available.
  -S SEED, --seed SEED  The random seed. The default value is 0.

🍹 Details

Here are the details of argparse arguments you can change:

tag arguments short description type/action default
🟥 --pretrain -P Whether this training is pretraining. "store_true" False
🟩 --tsne -TS If you want to draw the clustering result with scatter,
you can use it.
"store_true" False
🟩 --heatmap -H If you want to draw the heatmap of the embedding
representation learned by model, you can use it.
"store_true" False
🟥 --norm -N Whether to normalize the adj, default is False.
Using '-N' to load adj with normalization.
"store_true" False
🟦 --self_loop_false -SLF Whether the adj has self-loop, default is True.
Using '-SLF' to load adj without self-loop.
"store_false" True
🟦 --symmetric_false -SF Whether the normalization type is symmetric.
Using '-SF' to load asymmetric adj.
"store_false" True
🟥 --model -M The model you want to train.
Should correspond to the model in the model directory.
str "SDCN"
🟥 --dataset -D The dataset you want to train.
Should correspond to the dataset name in the dataset directory.
str "acm"
🟦 --k -K For graph dataset, it is set to None.
If the dataset is not graph type,
you should set k to construct 'KNN' graph of dataset.
int None
🟦 --t -T If the model need to get the matrix M, such as DAEGC,
you should set t according to the paper.
None denotes the model needn't M.
int None
🟥 --loops -LS The training times. If you want to train the model
for 10 times, you can set it to 10.
int 1
🟥 --root -R If you need to change the relative path to the
absolute path, you can set it to root path.
str None
🟪 --desc -DS The description of this experiment. str "default"
🟦 --feature -F The datatype of feature.
'tenor' and 'npy' are available.
str "tensor"
🟦 --label -L The datatype of label.
'tenor' and 'npy' are available.
str "npy"
🟦 --adj -A The datatype of adj.
'tenor' and 'npy' are available.
str "tensor"
🟥 --seed -S The random seed. It is 0 if not specified. int 0

💡 Tips:

  • The arguments marked with 🟥 are usually need to be specified.
  • The arguments marked with 🟩 are the drawing functions.
  • The arguments marked with 🟦 are related to the data loading.
  • The argument marked with 🟪 is strongly recommended to you to record the experimental key points.
  • Note that "--norm" is used in the graph convolutional network to obtain a symmetric normalized adjacency matrix, but it is not required for the graph attention network. If both are used at the same time, it is recommended to obtain the adjacency matrix without symmetric normalization first, and then manually symmetric normalize it.

🧩 Scalability

Strong scalability is a prominent feature of this framework. If you want to run your own code in this framework, you can follow the steps:

🐯 Model Extension

🚄 Step 1: Write a model file model.py using Pytorch and a training function file train.py and then put them into a directory named after the uppercase of model name. Then put it into the model directory. We provide the template file in the template directory.

🚄 Step 2: If your model need to be pretrained, you need to write a pretraining file train.pyand put it into a directory named after pretrain_{module(lowercase)} _for_{model (lowercase)}, then put it into the model directory. We provide the template file in the template directory.

🚄 Step 3: Modify the pretrain_type_dict in line 38 in path_manager.py. The format is "model name(uppercase)": [items]. If your model needn't be pretrained, let the list null. Otherwise, you should list all modules you need to pretrain. For example, if you want to pretrain AE module, you should add "pretrain_ae" to the list. Meanwhile, please check whether the pretrain type exists in if-else sentence, if not, please add it manually.

🚄 Step 4: Run your code!

🐴 Dataset Extension

🚌 Step 1: Make sure that your dataset are well processed and the file suffix is 'npy' which denotes the file store the numpy array. If your dataset is graph data, you need to include {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy. If your dataset is non-graph data, there are two ways to handle. One is directly using {dataset name}_feat.npy、{dataset name}_label.npy, and set the type of constructing graph in line 167 in load_data.py. If the construct type not exists, please add it to the function construct_graph in data\_processor.py. Another is to construct graph data manually, and use {dataset name}_feat.npy、{dataset name}_label.npy、{dataset name}_adj.npy, but you need remember what value the k used because the dataset is considered as graph dataset.

🚌 Step 2: Put the file above to a directory named after the lowercase of dataset name. Then put them into the dataset directory.

🚌 Step 3: Add the information about the dataset in the dataset_info.py.

🚌 Step 4: Use your dataset!

🍎 Ending

Graph deep clustering is currently in a stage of rapid development, and more graph clustering methods will be proposed in the future. Therefore, providing a unified code framework can save researchers' coding and experiment time, and put more energy on the theoretical innovation. It is believed that graph clustering will reach a higher level in the future.

If this warehouse is helpful to you, please remember to Star~😘.

Citation

If you use our code, please cite these papers:

@article{ding2023graph,
title = {Graph clustering network with structure embedding enhanced},
journal = {Pattern Recognition},
volume = {144},
pages = {109833},
year = {2023},
issn = {0031-3203},
doi = {https://doi.org/10.1016/j.patcog.2023.109833},
url = {https://www.sciencedirect.com/science/article/pii/S0031320323005319},
author = {Shifei Ding and Benyu Wu and Xiao Xu and Lili Guo and Ling Ding},
}

@article{ding2024towards,
author = {Ding, Shifei and Wu, Benyu and Ding, Ling and Xu, Xiao and Guo, Lili and Liao, Hongmei and Wu, Xindong},
title = {Towards Faster Deep Graph Clustering via Efficient Graph Auto-Encoder},
year = {2024},
issue_date = {September 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {18},
number = {8},
issn = {1556-4681},
url = {https://doi.org/10.1145/3674983},
doi = {10.1145/3674983},
journal = {ACM Trans. Knowl. Discov. Data},
month = {aug},
articleno = {202},
numpages = {23},
}