BrambleXu / knowledge-graph-learning

A curated list of awesome knowledge graph tutorials, projects and communities.
MIT License
735 stars 120 forks source link

FTIR(J)-2016-Semantic Search on Text and Knowledge Bases #312

Open BrambleXu opened 4 years ago

BrambleXu commented 4 years ago

Summary:

这是一篇关于语义检索的survey,详细介绍了关于语义检索的各种范式,设计到的NLP任务,以及基于Text和基于KB的不同方法。

FTIR: Foundations and Trends in Information Retrieval

Resource:

Paper information:

Notes:

2 Classification by Data Type and Search Paradigm

2.1 Data Types and Common Datasets

2.1.2 Structured Data / Knowledge Bases

KB is a collection of records in database. Records are often stored as triples in the form subject predicate object.

Collections of records / triples from difference sources with different naming schemes are counted as Combined Data, which is discussed in Section 2.1.3.

Freebase里的数据格式:

image

一些常用的KB数据库:

image

Data Formats:

KB数据通常保存为RDF的格式。序列化后,可以用下面的形式保存:

2.1.3 Combined Data

可以将Text和KB进行结合,也可以将多个KB进行结合。

combined data的两个原则:

Commonly Used Datasets:

image

Text Linked to a Knowledge Base:

将link边骂道text中的格式是XML。下面是Wikipedia LOD的一个例子:

image

Semantic Web:

The data from the Semantic Web is often also called linked open data (LOD), because contents can be contributed and interlinked by anyone, just like web pages (but in a different format, see below). 我们将这种数据叫作semantic web data,它符合combining data的两个原则:

其中“mult”原则可以被RDF实现,可以把任意的文本链接到RDF的数据上。比如下面关于法国城市Emburn的例子。主要前缀 rdf: 和 gn: 可以让内容更紧凑,方便我们阅读:

image

“link”原则可以通过semantic markup来实现。比如下面是一个HTML页面,使用了Microdata markup:

image

下面是4个最常用的semantic markup。前3个直接使用HTML tags。而JSON-LD比其他3个的优势在于,可以更清晰地分割开ordinary content和semantic content:

image

2.2 Search Paradigms

2.2.1 Keyword Search

image

2.2.2 Structured Search

image

2.2.3 Natural Language Search

image

3 Basic NLP Tasks in Semantic Search

TODO

4 Approaches and Systems for Semantic Search

4.2 Structured Search in Knowledge Bases

image

4.2.1 Basic Techniques

两种保存KB的方法:保存到standard relational database management system (RDBMS),或者triple store。后者在4.2.2部分介绍。

如果是保存到RDBMS,query是SPARQL,那么query可以被转换为SQL queries.

Performance

Dedicated triples stores 比 RDBMS 有优势。比如可以使用针对triple数据特化的index data structure。但是,RDBMS在处理complex query方面也有优势。

4.2.2 Systems

到2016年为止,3个最常用的系统是:Virtuoso,Jena, Sesame。

4.3 Structured Data Extraction from Text

image