apoorvumang / kgt5

ACL 2022: Sequence-to-Sequence Knowledge Graph Completion and Question Answering (KGT5)
Apache License 2.0
99 stars 18 forks source link

PWC

KGT5

This is the implementation for the ACL 2022 Main Conference paper Sequence to Sequence Knowledge Graph Completion and Question Answering (KGT5).

Click here for a demo

We train a sequence-to-sequence T5-small model from scratch - we do not initialize with the pre-trained LM weights. The task the model is trained on is head/tail prediction, where input is "\<prefix>:\<head entity>\<sep>\<relation>" and output expected is "\<tail entity>". We use unique textual representations for each entity based on their WikiData title, and disambiguate using description/wikidata ID if necessary. For KGQA, the model pre-trained on KG link prediction is finetuned using question-answer pairs.

image

NEWS

New Codebase

We extended KGT5 to KGT5-context. This approach improves link prediction performance considerably. Further, it comes with a new codebase for easier reproduction.

KGT5-context codebase

Semi-Inductive Link Prediction

KGT5 as well as KGT5-context can also be used for semi-inductive link prediction as showcased on the new Wikidata5M-SI benchmark.

A Benchmark for Semi-Inductive Link Prediction in Knowledge Graphs

Checkpoints

You can find checkpoints for the dataset Wikidata5M in our new KGT5-context codebase.

Resources

The main branch currently only supports KGC on Wikidata5M and only hits@1 unfiltered evaluation. Branch 'apoorv-dump' contains the latest code but it is still being cleaned. Data is yet to be uploaded. If you need any particular data/pretrained models that we used to obtain results then please raise a github issue and we will provide it.

For details/evaluation on WikiKG90Mv2, please see https://huggingface.co/apoorvumang/kgt5-wikikg90mv2.

To (kind of) reproduce results for WikiData5M you can use the following code.

You need pytorch packages + huggingface transformers and huggingface accelerate.

pip install transformers
pip install accelerate

KGC Dataset download: https://storage.googleapis.com/t5-kgc-colab/data/data.zip

KGQA Dataset download: https://storage.googleapis.com/t5-kgc-colab/data/data_kgqa.zip

Note: Please see issue #13 for details about the KGQA dataset. More details will be added here in the README soon.

Usage

Training

Multi GPU

Set the parameter --nproc_per_node same as the number of GPUs that you use

CUDA_VISIBLE_DEVICES=1,2,3,4,5,7 python3 -m torch.distributed.launch --nproc_per_node 6 --use_env ./main_accelerate.py \
--save_prefix wd5m-6gpu \
--model_size small --dataset wikidata5m \
--batch_size 64 --save_steps 5000 \
--loss_steps 500

Single GPU

CUDA_VISIBLE_DEVICES=0 python3 main_accelerate.py \
--save_prefix wd5m-1gpu \
--model_size small --dataset wikidata5m \
--batch_size 64 --save_steps 5000 \
--loss_steps 500

Evaluation

This evaluates hits@1 unfiltered

CUDA_VISIBLE_DEVICES=0 python3 eval_accelerate.py --prefix wd5m-6gpu --checkpoint 90000 \
--dataset wikidata5m --batch_size 200

How to cite

If you used our work or found it helpful, please use the following citation:

@inproceedings{saxena2022kgt5,
  title={Sequence-to-Sequence Knowledge Graph Completion and Question Answering},
  author={Saxena, Apoorv and Kochsiek, Adrian and Gemulla, Rainer},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
  year={2022}
}