3 Proposed Method

整个手法分三步，创建训练数据，获取特征，训练分类器

3.1 Creating training data

使用existing corpus annotated with ENE types. 使用了BCCWJ，把这个当做seed dictionary。然后利用external dictionary/gazetteers来与wiki的标题做交际，创建训练数据。

从BCCWJ提出所有tagged section。一共有8828个newswire articles with 255407 tagged sections。
从wiki里抽取的dump作比较，title一致的话，就算是匹配成功。

3.2 Feature Extraction

3.2.1 Features related to the surface string of the title of an article

比如“市”说明这是个城市的名字，还有“川”，‘山’之类的名词可能指代xx河，xx山。这里一共有16个特征，T表示titile。

3.2.2 Features related to the content of an article

the first sentence and headings (section titles).

(C17) Last common noun or noun/counter suffix of the first sentence： It is widely known that the first sentence of an article in Wikipedia is a definition statement.
(C18) Headings: Headings or section titles summarize what is written in an article.

3.2.3 Features related to the meta data of an article

(M21) Infobox attributes Infoboxes provide tabular data for articles. Since the attributes (attribute names) of a table generally indicate the attributes of the entity in question, in a similar manner to (Saleh et al., 2010), we use the infobox attributes to create our features. For each attribute in the infobox, we create a bag-of-words feature indicating the existence of that attribute. (这个可以用kg embedding代替)

3.3 Training a classifier

we chose logistic regression because of its capability to output probability estimates。

Result:：

4.1 Results by individual features

4.2 Results by the combination of features

最佳的组合方式里的特征有下面7个。

(T3-T4) Character unigram/bigram
(T5) POS unigram
(T10) Last two character(s)
(T13) Proper noun semantic categories:JTAG outputs special semantic categories for proper nouns
(T16) Character type construction In addition to the last character type, this feature focuses on how character types constitute a title. For example, “London” is written with four consecutive Katakana characters in Japanese. Therefore, we have “K-K-K-K” (K stands for Katakana) as a binary feature. 这个特征用于公司的话，要需要考虑英文字母的问题。这样的话应该是H代表汉字，K代表katakana，E代表English。
(M21) Infobox attributes Infoboxes provide tabular data for articles. Since the attributes (attribute names) of a table generally indicate the attributes of the entity in question, in a similar manner to (Saleh et al., 2010), we use the infobox attributes to create our features. For each attribute in the infobox, we create a bag-of-words feature indicating the existence of that attribute. We used the Japanese version of DBPedia (http://ja.dbpedia.org/), which offers the infobox data for Japanese articles as triples.

Thoughts:

Next Reading:

BrambleXu / knowledge-graph-learning

COLING-2012-Creating an Extended Named Entity Dictionary from Wikipedia #272