Split-GNN / SplitGNN

18 stars 9 forks source link

SplitGNN

This is the PyTorch implementation for the CIKM 2023 paper:

SplitGNN: Spectral Graph Neural Network for Fraud Detection against Heterophily

Bin Wu, Xinyu Yao, Boyan Zhang, Kuo-Ming Chao, Yinsheng Li

model

Dependencies

Usage

Model Training

We take YelpChi as an example to illustrate the usage of repository.

# Unzip the dataset
upzip ./data/YelpChi.zip ./data/

# Move to src/
cd src/

# Convert the original dataset to dgl graph
# The generated dgl graph contains the features, graph structure and edge labels.
python data_preprocess.py --dataset yelp

# Train and test the dataset
# If you want to change the parameters in training process, you can modify the corresponding yaml file in config.
python train.py --dataset yelp 

Data Description of FDCompCN

Dataset #Nodes #Fraud(%) #Features Relation #Edges
FDCompCN 5,317 559 (10.5%) 57 C-I-C
C-S-C
C-P-C
Homo
5,686
760
1,043
7,407

A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.

FDCompCN has three relations:

1) C-I-C that connects companies that have investment relationships. 2) C-P-C that connects companies and their disclosed customers. 3) C-S-C that connects companies and their disclosed suppliers.

Each company contains basic and financial statement information. The basic information includes registered capital, currency, operating status, company type, industry, city, personnel size, and the number of insured individuals. The financial statement information includes such as long-term accounts receivable, long-term liabilities, and total assets. The original financial statement information contains 149 indicators. We retain 49 financial indicators. Finally, we process the basic information and financial statement information into 57-dimensional features. Financial statement fraud includes seven types of violations disclosed by Chinese regulators, including inflated profits, inflated assets, false statements, delay in disclosure, omission of significant information, fraudulent disclosures, and general accounting irregularities. Companies with more than three violations are labeled as fraudulent samples, while other companies are labeled as benign. 559 fraud samples and 4758 benign samples are ultimately obtained, with fraud samples accounting for 10.51%.

Reproduce Results

The hyperparameter of SplitGNN to reproduce the results.

Paramater YelpChi Amazon FDCompCN
learning rate 0.01 0.1 0.001
weight decay 0.00005 0.00005 0.00005
gamma 0.6 0.4 0.8
n_hidden 8 8 8
order C 2 2 3
tunable K 1 1 2
dropout 0.1 0.1 0.1
max epoch 1000 1000 1000
early stop 100 50 100

Run on your Datasets

To run SplitGNN on your datasets, you need to prepare the following data:

Transform the data into DGL format using the code in data_preprocess.py as a reference.

Citation

Please cite the paper if you use our code or data.

@inproceedings{10.1145/3583780.3615067,
  author = {Wu, Bin and Yao, Xinyu and Zhang, Boyan and Chao, Kuo-Ming and Li, Yinsheng},
  title = {SplitGNN: Spectral Graph Neural Network for Fraud Detection against Heterophily},
  year = {2023},
  url = {https://doi.org/10.1145/3583780.3615067},
  booktitle = {Proceedings of the 32nd ACM International Conference on Information and Knowledge Management},
  series = {CIKM '23}
}