eXascaleInfolab / ActiveLink

Deep active learning framework for link prediction in knowledge graph
24 stars 7 forks source link

ActiveLink

Dataset

data/<dataset_name>

Each dataset contains 3 files: train.txt, test.txt and valid.txt Each file contains a list of triples, one triple per line. Triple format: <entity_1>\t<relation>\t<entity_2>

Before the first usage the dataset should be preprocessed:

python preprocess.py <dataset_name>

This step generates 6 files:

Embeddings

For clustering entities (Structured Uncertainty sampling, see Section 3.2 of the paper for more details) we need to train their embeddings beforehand. We used the TransE model with the following parameters:

Details on TransE: Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, Xuan Zhu. Learning Entity and Relation Embeddings for Knowledge Graph Completion. The 29th AAAI Conference on Artificial Intelligence (AAAI'15)

NB TransE requires mapping from entity/relation label to id. Use the entity2id.txt and relation2id.txt files generated at the preprocessing step.

Running a model

Parameters

You can configure your run via command line arguments:

--al-epochs
    number of iterations of active learning: (dataset_size * fraction_used) / sample_size
--batch-size
    training batch size
--dataset
    name of dataset
--embedding-dim
    number of embedding dimensions for entities and relations 
--early-stop-threshold
    stop training when trigger value is above this threshold (see below)
--eval-rate
    monitor model performance each N epochs (see below)
--inner-lr
    learning rate for inner update in meta-incremental training
--lr
    learning rate (meta-incremental training: learning rate for meta update)
--lr-decay
    learning rate decay
--model
    link prediction model, two options possible: ConvE or MLP
--n-clusters
    number of clusters for Structured Uncertainty sampling
--sample-size
    number of training examples per one AL iteration
--sampling-mode
    random, uncertainty, structured or structured-uncertainty
--training-mode
    retrain, incremental or meta-incremental
--window-size
    size of the window for meta-incremental training

To reproduce the paper results for FB15k-237: python main.py --dataset FB15k-237 --model ConvE (all the other parameters have right default values)

Early Stopping

We use early stopping at each iteration of active learning. As a trigger we use the following formula:

(100 * (MR / MR_opt - 1)),

where MR is a mean rank after the current training epoch, and MR_opt is the best mean rank achieved on the previous training epochs within the same active learning iteration.

Evaluation Rate

Since active learning use a small fraction of a dataset at each iteration, the overall number of training epochs is much bigger for the active learning setup compared to a traditional supervised approach (in fact one iteration of active learning is comparable to the full training cycle of non-active learning in terms of the number of training epochs). For time efficiency we do not evaluate model performance after each epoch but rather after each epochs.

Important Note

The library requires pytorch version 0.3.1. For newer versions some migration updates might be needed.

References

For the full method description and experimental results please refer to our paper:

Natalia Ostapuk, Jie Yang, and Philippe Cudre-Mauroux. “ActiveLink: Deep Active Learning for Link Prediction in Knowledge Graphs.” In Proceedings of the Web Conference (WWW 2019), 2019 PDF

Acknowledgement

The model architecture as well as some valuable pieces of code are borrowed from this project: https://github.com/TimDettmers/ConvE