AskNowQA / LC-QuAD

A data set of natural language queries with corresponding SPARQL queries
GNU General Public License v3.0
92 stars 30 forks source link
dataset question-answering rdf sparql

LC-QuAD

Largescale Complex Question Answering Dataset

:loudspeaker: Announcement: LCQUAD 2.0 is now released, checkout our website http://lc-quad.sda.tech .

Download

:hatching_chick: Train, Test Data

Links

:earth_africa: Webpage | :page_facing_up: Paper | :office: Lab

Introduction

We release, and maintain a gold standard KBQA (Question Answering over Knowledge Base) dataset containing 5000 Question and SPARQL queries. LC-QuAD uses DBpedia v04.16 as the target KB.

Usage

License: You can download the dataset (released with a GPL 3.0 License), or read below to know more.

Versioning: We use DBpedia version 04-2016 as our target KB. The public DBpedia endpoint (http://dbpedia.org/sparql) no longer uses this version, which might cause many SPARQL queries to not retrieve any answer. We strongly recommend hosting this version locally. To do so, see this guide

Splits: We release the dataset split into training, and test in a 80:20 fashion.

Format: The dataset is released in JSON dumps, where the key corrected_question contains the question, and query contains the corresponding SPARQL query.

The dataset generated has the following JSON structure, kept intact for .

{
    '_id': 'Unique ID of this datapoint',
    'corrected_question': 'Corrected, Final Question',
    'id': 'Template ID',
    'query': 'SPARQL Query',
    'template': 'Template used to create SPARQL Query',
    'intermediary_question': 'Automatically generated, grammatically incorrect question'
}

Cite

@inproceedings{trivedi2017lc,
  title={Lc-quad: A corpus for complex question answering over knowledge graphs},
  author={Trivedi, Priyansh and Maheshwari, Gaurav and Dubey, Mohnish and Lehmann, Jens},
  booktitle={International Semantic Web Conference},
  pages={210--218},
  year={2017},
  organization={Springer}
}

Benchmarking/Leaderboard

We're in the process of automating the benchmarking process (and updating results on our webpage). In the meantime, please get in touch with us at priyansh.trivedi@uni-bonn.de, and we'll do it manually. Apologies for this inconvinience.

Methodology

Overview

We start with a set of Seed Entities, and Predicate Whitelist. Using the whitelist, we generate 2-hop subgraphs around seed entities. With a seed entity as supposed answer, we juxtapose SPARQL Templates onto the subgraph, and generate SPARQL queries.

Corresponding to SPARQL template, and based on certain conditions, we assign hand-made NL question templates to the SPARQLs. Refer to this diagram to understand the nomenclature used in templates.

Finally, we follow a two-step (Correct, Review) system to generate a grammatically correct question for every template-generated one.

Changelog

0.1.3 - 19-06-2018

0.1.2 - 28-01-2018

0.1.1 - 27-10-2017

0.1.0 - 01-05-2017