KGT5

This is the implementation for the ACL 2022 Main Conference paper Sequence to Sequence Knowledge Graph Completion and Question Answering (KGT5).

Click here for a demo

We train a sequence-to-sequence T5-small model from scratch - we do not initialize with the pre-trained LM weights. The task the model is trained on is head/tail prediction, where input is "\<prefix>:\<head entity>\<sep>\<relation>" and output expected is "\<tail entity>". We use unique textual representations for each entity based on their WikiData title, and disambiguate using description/wikidata ID if necessary. For KGQA, the model pre-trained on KG link prediction is finetuned using question-answer pairs.

NEWS

New Codebase

We extended KGT5 to KGT5-context. This approach improves link prediction performance considerably. Further, it comes with a new codebase for easier reproduction.

KGT5-context codebase

Semi-Inductive Link Prediction

KGT5 as well as KGT5-context can also be used for semi-inductive link prediction as showcased on the new Wikidata5M-SI benchmark.

A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs

Checkpoints

You can find checkpoints for the dataset Wikidata5M in our new KGT5-context codebase.

Resources

The main branch currently only supports KGC on Wikidata5M and only hits@1 unfiltered evaluation. Branch 'apoorv-dump' contains the latest code but it is still being cleaned. Data is yet to be uploaded. If you need any particular data/pretrained models that we used to obtain results then please raise a github issue and we will provide it.

For details/evaluation on WikiKG90Mv2, please see https://huggingface.co/apoorvumang/kgt5-wikikg90mv2.

To (kind of) reproduce results for WikiData5M you can use the following code.

You need pytorch packages + huggingface transformers and huggingface accelerate.

pip install transformers
pip install accelerate

KGC Dataset download: https://storage.googleapis.com/t5-kgc-colab/data/data.zip

KGQA Dataset download: https://storage.googleapis.com/t5-kgc-colab/data/data_kgqa.zip

Note: Please see issue #13 for details about the KGQA dataset. More details will be added here in the README soon.

Usage

Training

Multi GPU

Set the parameter --nproc_per_node same as the number of GPUs that you use

CUDA_VISIBLE_DEVICES=1,2,3,4,5,7 python3 -m torch.distributed.launch --nproc_per_node 6 --use_env ./main_accelerate.py \
--save_prefix wd5m-6gpu \
--model_size small --dataset wikidata5m \
--batch_size 64 --save_steps 5000 \
--loss_steps 500

Single GPU

CUDA_VISIBLE_DEVICES=0 python3 main_accelerate.py \
--save_prefix wd5m-1gpu \
--model_size small --dataset wikidata5m \
--batch_size 64 --save_steps 5000 \
--loss_steps 500

Evaluation

This evaluates hits@1 unfiltered

CUDA_VISIBLE_DEVICES=0 python3 eval_accelerate.py --prefix wd5m-6gpu --checkpoint 90000 \
--dataset wikidata5m --batch_size 200

How to cite

If you used our work or found it helpful, please use the following citation:

@inproceedings{saxena2022kgt5,
  title={Sequence-to-Sequence Knowledge Graph Completion and Question Answering},
  author={Saxena, Apoorv and Kochsiek, Adrian and Gemulla, Rainer},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
  year={2022}
}

apoorvumang / kgt5

readme