BrambleXu / knowledge-graph-learning

A curated list of awesome knowledge graph tutorials, projects and communities.
MIT License
735 stars 120 forks source link

COLING-2012-Creating an Extended Named Entity Dictionary from Wikipedia #272

Open BrambleXu opened 4 years ago

BrambleXu commented 4 years ago

Summary:

之前的NER最大也才18个标签。这篇论文挑战200个标签。整个文章挺简单的,在特征设计那部分下了很多功夫。

Resource:

Paper information:

Notes:

这篇文章在构建特征方面,做了很多处理。数据集是 Comparison of Annotating Methods for Named Entity Corpora这篇文章里的BCCWJ。

3 Proposed Method

整个手法分三步,创建训练数据,获取特征,训练分类器

3.1 Creating training data

使用existing corpus annotated with ENE types. 使用了BCCWJ,把这个当做seed dictionary。然后利用external dictionary/gazetteers来与wiki的标题做交际,创建训练数据。

  1. 从BCCWJ提出所有tagged section。一共有8828个newswire articles with 255407 tagged sections。
  2. 从wiki里抽取的dump作比较,title一致的话,就算是匹配成功。

3.2 Feature Extraction

3.2.1 Features related to the surface string of the title of an article

比如“市”说明这是个城市的名字,还有“川”,‘山’之类的名词可能指代xx河,xx山。这里一共有16个特征,T表示titile。

3.2.2 Features related to the content of an article

the first sentence and headings (section titles).

3.2.3 Features related to the meta data of an article

(M21) Infobox attributes Infoboxes provide tabular data for articles. Since the attributes (attribute names) of a table generally indicate the attributes of the entity in question, in a similar manner to (Saleh et al., 2010), we use the infobox attributes to create our features. For each attribute in the infobox, we create a bag-of-words feature indicating the existence of that attribute. (这个可以用kg embedding代替)

3.3 Training a classifier

we chose logistic regression because of its capability to output probability estimates。

Result:

4.1 Results by individual features

image

4.2 Results by the combination of features

image

最佳的组合方式里的特征有下面7个。

Thoughts:

Next Reading: