BrambleXu / knowledge-graph-learning

A curated list of awesome knowledge graph tutorials, projects and communities.
MIT License
735 stars 120 forks source link

TKDE(J)-2018-Automated phrase mining from massive text corpora #276

Open BrambleXu opened 4 years ago

BrambleXu commented 4 years ago

Summary:

通过从公开的KB里获取大量quality phrasees,来自动化phrase mining。AutoPhrase可以支持任意的语言。用户需要提供两种数据: a general knowledge base together with a pre-trained POS tagger.

Resource:

Paper information:

Notes:

两大技术:

  1. Robust Positive-Only Distant Training。从general kb里收集高质量的phrase。We independently build samples of positive labels from general knowledge bases and negative labels from the given domain corpora, and train a number of base classifiers. We then aggregate the predictions from these classifiers, whose independence helps reduce the noise from negative labels.
  2. POS-Guided Phrasal Segmentation. 针对不同的语言,如果连哪种语言都不知道的话,最终效果可能会比较差。

这篇论文假设需要两个数据: a general knowledge base together with a pre-trained POS tagger.

Model Graph:

image

we propose a robust positive-only distant training to minimize the human effort and develop a POS-guided phrasal segmentation model to improve the model performance.

A phrase is defined as a sequence of words that appear consecutively in the text, forming a complete semantic unit in certain contexts of the given documents [8]. The phrase quality is defined to be the probability of a word sequence being a complete semantic unit, meeting the following criteria:

满足上面所有条件的才会被认为是quality phrases.

image

image

Result:

Thoughts:

构建一个日本business kb from wiki。

Next Reading: