cenguix / Text2KGBench

Repo ISWC-2023 Tekgen Corpus Submission
Apache License 2.0
53 stars 10 forks source link

Text2KG: A Benchmark for Ontology Driven Knowledge Graph Generation from Text

Code License Data License: CC BY 4.0 Python 3.9+

This is the repository for ISWC 2023 Resource Track submission for Text2KGBench: Benchmark for Ontology-Driven Knowledge Graph Generation from Text. Text2KGBench is a benchmark to evaluate the capabilities of language models to generate KGs from natural language text guided by an ontology. Given an input ontology and a set of sentences, the task is to extract facts from the text while complying to the given ontology (concepts, relations, domain/range constraints) and being faithful to the input sentences.

It contains two datasets (i) Wikidata-TekGen with 10 ontologies and 13,474 sentences and (ii) DBpedia-WebNLG with 19 ontologies and 4,860 sentences.

An example

An example test sentence:

Test Sentence:
{"id": "ont_music_test_n", "sent": "\"The Loco-Motion\" is a 1962 pop song written by 
American songwriters Gerry Goffin and Carole King."}

An example ontology:

Ontology: Music Ontology

music3

Expected Output:

{
 "id": "ont_k_music_test_n", 
 "sent": "\"The Loco-Motion\" is a 1962 pop song written by American songwriters Gerry Goffin and Carole King.", 
 "triples": [
  {
    "sub": "The Loco-Motion", 
    "rel": "publication date",
    "obj": "01 January 1962"
  },{
    "sub": "The Loco-Motion",
    "rel": "lyrics by",
    "obj": "Gerry Goffin"
  },{
    "sub": "The Loco-Motion", 
    "rel": "lyrics by", 
    "obj": "Carole King"
  },]
}

The data is released under under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY 4.0) License.

The structure of the repo is as the following.

This benchmark contains data derived from TekGen corpus (part of the KELM corpus) [1] released under CC BY-SA 2.0 license and WebNLG 3.0 corpus [2] released under CC BY-NC-SA 4.0 license.

[1] Oshin Agarwal, Heming Ge, Siamak Shakeri, and Rami Al-Rfou. 2021. Knowledge Graph Based Synthetic Corpus Generation for Knowledge-Enhanced Language Model Pre-training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3554–3565, Online. Association for Computational Linguistics.

[2] Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.