DexterZeng / EntMatcher

5 stars 1 forks source link

EntMatcher: An Open-source Library

language-python3 made-with-Pytorch Contributions Welcome

Entity alignment (EA) identifies equivalent entities that locate in different knowledge graphs (KGs), and has attracted growing research interests over the last few years with the advancement of KG embedding techniques. Although a pile of embedding-based EA frameworks have been developed, they mainly focus on improving the performance of entity representation learning, while largely overlook the subsequent stage that matches KGs in entity embedding spaces. Nevertheless, accurately matching entities based on learned entity representations is crucial to the overall alignment performance, as it coordinates individual alignment decisions and determines the global matching result. Hence, it is essential to understand how well existing solutions for matching KGs in entity embedding spaces perform on present benchmarks, as well as their strengths and weaknesses. To this end, in this article we provide a comprehensive survey and evaluation of matching algorithms for KGs in entity embedding spaces in terms of effectiveness and efficiency on both classic settings and new scenarios that better mirror real-life challenges. Based on in-depth analysis, we provide useful insights into the design trade-offs and good paradigms of existing works, and suggest promising directions for future development.

Contents

Overview

We use Python, Pytorch and Tensorflow to develop an open-source library, namely EntMatcher.

The architecture of EntMatcher library is presented in the blue block of figure above, which takes as input unified entity embeddings and produces the matched entity pairs. It has the following three major features:

Currently, EntMatcher Library (with additional modules) has integrated the following modules, and the approaches in modules can be combined arbitrarily:

Getting started

Code organization

data/: datasets
models/: generating the input unified entity embeddings using existing representation learning methods
src/
|-- entmatcher/
|   |--algorithms/: package of the standalone algorithms
|   |--extras/: package of the extra modules
|   |--modules/: package of the main modules
|   |--embed_matching.py: implementaion of calling the standalone algorithms
|   |--example.py: implementaion of calling the modules

Dependencies

Installation

We recommend creating a new conda environment to install and run EntMatcher.

conda create -n entmatcher python=3.8.10
conda activate entmatcher
conda install pytorch==1.x torchvision==0.x torchaudio==0.x cudatoolkit=xxx -c pytorch
conda install scipy
conda install tensorflow-gpu==2.6.0
conda install Keras==2.6.0
conda install -c conda-forge fml

Then, EntMatcher can be installed using pip with the following steps:

git clone https://github.com/DexterZeng/EntMatcher.git EntMatcher
cd EntMatcher
pip install EntMatcher-0.1.tar.gz

Usage

1. Generate input unified entity embeddings

cd models
python gcn.py --data_dir "zh_en"
python rrea.py --data_dir "zh_en"

The data_dir could be chosen from the directories of these datasets. Or you can directly run:

bash stru.sh

If you want to reproduce our results, you can download the trained [structural embeddings] we provide. Among them, vec.npy is a structural embedding trained by GCN, and vec-new.npy is a structural embedding trained by RREA.

As for the auxiliary information, we obtain the entity name embeddings from EAE, which can also be found here.

2. Matching KGs in entity embedding spaces

To call different algorithms, you can run

cd src
python embed_matching.py

where you can set --algorithm to dinf, csls, rinf, sinkhorn, hun, sm, rl

Other configurations: --mode can be chosen from 1-to-1, mul, unm; --encoder can be chosen from gcn, rrea; --features can be chosen from stru, name, struname; --data_dir can be chosen from the dataset directories.

Or you can explore different modules, and design new strategies by following examples.py Main configurations:

3. Example

The following is an example about how to use EntMatcher in Python (We assume that you have already downloaded our datasets and know how to use it)

First, you need to generate vectors from the EA model and save them to an npy file named after the model.

python rrea.py --data_dir "zh_en"

Then, you can use these vectors to select the appropriate algorithm for matching calculations.

import entmatcher as em

model = args.encoder
args = load_args("hyperparameter file folder")
kgs = read_kgs_from_folder("data folder")
dataset = em.extras.Datasets(args)
algorithms = em.algorithms.csls
se_vec = np.load(args.data_dir + '/' + args.encoder + '.npy')
name_vec = dataset.loadNe()
algorithms.match([se_vec, name_vec], dataset)

For a more convenient use, You can use the code we prepared and just adjust the parameters to run:

python embed_matching.py --data_dir ../data/zh_en --encoder rrea --algorithm csls --features stru

Datasets

Existing Datasets statistics

We used popular EA benchmarks for evaluation:

The detailed statistics can be found in Table, where the numbers of entities, relations, triples, gold links, and the average entity degree are reported.

The original datasets are obtained from DBP15K dataset, GCN-Align and JAPE:

Dataset Description

Regarding the gold alignment links, we adopted 70% as test set, 20% for training,and 10% for validation.The folder names of the datasets used by the code are as follows:

Take the dataset DBP15K (ZH-EN) as an example, the folder data/zh_en contains:

Non 1-to-1 Alignment Dataset

We also offer our constructed non 1-to-1 alignment dataset FB_DBP_MUL (shortened as mul) , which adopts the same format.

Dataset Usage

Unzip the data.zip. For the usage of auxiliary information, obtain the name embeddings and structural embeddings files and place them under corresponding dataset directories.

For example, in the name embeddings, put name/zh_en/name_trans_vec_ftext.txt file in the name/zh_en folder into the data/zh_en folder.

Experiments and Results

To reproduce the experimental results in the paper, you can first download the unified structural embeddings and name embeddings. Then put the files under the corresponding directories.

Experiment Settings

Hardware configuration and hyper-parameter setting

We followed the configurations presented in the original papers of these algorithms, and tuned the hyper-parameters on the validation set.

Representation learning models

Since representation learning is not the focus of this work, we adopted two frequently used models, i.e., RREA and GCN .

Auxiliary information for alignment

Although EA underlines the use of graph structure for alignment(An experimental study of state-of-the-art entity alignment approaches,IEEE TKDE 2020), for a more comprehensive evaluation, we examined the influence of auxiliary information on the matching results by following previous works and using entity name embeddings to facilitate alignment. We also combined these two channels of information with equal weights to generate the fused similarity matrix for matching.

Similarity metric

After obtaining the unified entity representations E, a similarity metric is required to produce pairwise scores and generate the similarity matrix S. Frequent choices include the cosine similarity, the Euclidean distance and the Manhattan distance.In this work, we followed mainstream works and adopted the cosine similarity.

Next, you can run

cd src
python embed_matching.py --algorithm dinf --mode 1-to-1 --encoder gcn --features stru --data_dir "../data/zh_en"

Or you can directly run:

bash structural.sh
bash auxiliary.sh

and varying the parameter settings.

Results

The results of the experiment using only structural information and using auxiliary information are as follows:

The F1 scores of only using structural information

The F1 scores of using auxiliary information

Due to the instability of embedding-based methods, it is acceptable that the results fluctuate a little bit when running code repeatedly.

More features and experimental results will be published in subsequent papers.

If you have any questions about reproduction, please feel free to email to zengweixin13@nudt.edu.cn.