In this paper, we regard the problem of extracting relationships of several specific types among companies from news articles.
In summary, we propose a system to perform (directed) relationship extraction (RE) between companies
要想构建公司图谱,如何抽取关系非常重要,比如ownership of, partnership with, supplier of, and so on. Freebase and infoboxes 只包含一些公司的主要子公司(ownership关系),其他一些partnership with, supplier of,基本没有。
所以给定一个文本,我们希望:
发现两个同时出现的公司是否有一个商业关系
判断是什么商业关系
如果是 asymmetric business relationship,找到这个商业关系的方向
Snowball system 能解决一般的RE问题,这篇文章是基于snowball general idea的。用一小部分entity pairs当做seed set,然后产生cadidate patterns(基于context),然后用scoring function来选出最主要的patterns,用于提取新的entity pairs。最后,新选出来的pairs被加到seed set里,然后产生更多的patterns。但是snowball function只对one-to-many relation正确,比如headquarter of(Microsoft, Redmond) relationship, Microsoft has exactly one headquarter. Business relationships do not adhere to this characteristic, which is the reason Snowball is unable to solve the problem at hand.
这篇文章的总结
In summary, we propose a system to perform (directed) relationship extraction (RE) between companies from textual data. Addressing this problem, we present a novel, semi-supervised relationship extraction method, which requires only a minimum amount of manually specified company pairs to efficiently extract new ones that belong to the same target relationship. Additionally, we provide a straightforward solution to reliably identify the direction of asymmetric relationships. We show that our approach is superior to more advanced distant learning approaches for the particularly difficult case of many-to-many relationships.
2 Background and Related Work
主要的出发点是尽量少标注。 介绍了snowball是bootstrapping的一种策略。代表有Snowball [1] and StatSnowball [14]。
选出pattern之后,用于发现input中新的company pairs之间的关系。比如我们已经有了acquisition of这个pattern,然后可以从“ . . after Disney’s acquisition of Pixar Animation Studios“ 选出 (The Walt
Disney Company, Pixar)。然后,可以用这些新的company pair来扩充seed set。
一个被提取出的pattern包含两个compy的variable,COMP1 和 COMP2,然后key-phrase从这两个companise之间被抽取出来,以及direction。下面是个例子“. . . YouTube, the video-sharing Web site owned by Google . . . ”. we can generate the pattern <COMP1, COMP2, owned by, ←>. By applying this pattern to this sentence we obtain the following instantiation of the pattern <YouTube, Google, owned by, ←>, indicating that Google owns YouTube.
pattern的质量取决于key-phrase,一个好的pattern要满足两个标准, characterize a single type of relationship (which in turn improves the precision of the extraction result) and be as general as
possible (to extract many new company pairs)。(1 只辨明一种关系,2 尽量选择通用的phrase,这样可以提取更多的company pairs)。
果然,这个就是hard encoding的缺点,如果用KGE的话,应该能有更好的泛用性。
基于这两个标准,尽量不选择哪些比较特殊的key-phrases,而且key-phrases要尽量紧凑。从news中提取relation的一个挑战在于记者们喜欢用不同的写作风格来描述同样一个business relationships。比如说“. ”. . News Corporation, which owns a minority interest in DirecTV”.这个句子里,我们可以搞清楚News Corporation是DirecTV的owner,通过找到ovwns这个单词。如果换一种表达方式,(i.e., “, which owns a minority interest in”),就能难找到这个关系了。解决办法是直接提取key-phrase "owns",用于定义ownership relationship。然后我们可以把原来的句子简化为“New Corporation owns DirecTV”
这个简化的办法有点问题,万一是own_by的话呢?用regex一个个去定义太麻烦了。
为了做到上面的简化,我们必须从intermediary context(它限定了是在两个entity之间)选择出most determining phrases。从直觉上讲,relation大部分都是通过verb和noun来表达的。比如在The tradeoffs between open and traditional relation extraction(ACL)里,大部分relaton是通过4种phrases来表示的,能表达86%的情况。即,“Verb", "Noun+Prep", "Verb+Prep", 还有一些在两个entity之间的“infinitive”。这里使用了Stanford POS tagger来提取key-phrases。基于POS tag,我们找到符合上面4种情况的类型。另外专门丢弃了哪些包含verb是 “to be”的,因为这个verb无法表明任何relation。
Summary:
用半监督的方式,从文本中提取各种复杂的商业关系。只用提供一点手工标注的公司名称。此外还提供了一个用于判断 asymmetric relationships方向的方法,比如“ownership of”这样的关系。说白了就是扩展了snowball系统,用于判断asymmetric relationships方向。
Resource:
Paper information:
Notes:
1 Business Networks
In this paper, we regard the problem of extracting relationships of several specific types among companies from news articles.
第二段举了一个很不错的例子。构建商业网络图谱,分析风险,评价公司这些task,是需要company之间的relationship信息的。比如Dell想要收购EMC,于是去银行借钱。银行需要进行风险评估,决定是否借贷。这里就是business relation发挥作用的地方了。通过分析两个公司的商业网,银行可能会发现很多EMC的很多子公司(subsidiary company)营收乏力,那么就会认为借钱给Dell风险太高,最终做出只借一点,或者完全不借。
要想构建公司图谱,如何抽取关系非常重要,比如ownership of, partnership with, supplier of, and so on. Freebase and infoboxes 只包含一些公司的主要子公司(ownership关系),其他一些partnership with, supplier of,基本没有。
所以给定一个文本,我们希望:
Snowball system 能解决一般的RE问题,这篇文章是基于snowball general idea的。用一小部分entity pairs当做seed set,然后产生cadidate patterns(基于context),然后用scoring function来选出最主要的patterns,用于提取新的entity pairs。最后,新选出来的pairs被加到seed set里,然后产生更多的patterns。但是snowball function只对one-to-many relation正确,比如headquarter of(Microsoft, Redmond) relationship, Microsoft has exactly one headquarter. Business relationships do not adhere to this characteristic, which is the reason Snowball is unable to solve the problem at hand.
将snowball扩展,引入一个key-phrase提取策略,可以移除company pairs之间不同要的context parts. 为了判断asymmetric relationships的关系,我们提出了一个方法,利用seed set里包含的一些信息。因为snowball无法处理many-to-many关系,所以我们添加了一个新的seed set来选出对应的pattern。我们还提出了一个holistic pattern identification strategy。
这篇文章的总结 In summary, we propose a system to perform (directed) relationship extraction (RE) between companies from textual data. Addressing this problem, we present a novel, semi-supervised relationship extraction method, which requires only a minimum amount of manually specified company pairs to efficiently extract new ones that belong to the same target relationship. Additionally, we provide a straightforward solution to reliably identify the direction of asymmetric relationships. We show that our approach is superior to more advanced distant learning approaches for the particularly difficult case of many-to-many relationships.
2 Background and Related Work
主要的出发点是尽量少标注。 介绍了snowball是bootstrapping的一种策略。代表有Snowball [1] and StatSnowball [14]。
3 Overview of our Approach
输入时textual data和seed set。通过disambiguated company mentions,从context中生成一些patterns。具体做法是如果一个在seed set里出现过的company pair同时出现在了一个句子里,那么这个句子的context可能包含这个seed set的的一些关系(sec 4.1)。因此,哪些包含多个company的句子,会被选为RE阶段的输入。比如一个例子是这样的,
. . . [[Verizon Communications|Verizon]]’s acquisition of [[MCI Inc.|MCI]]
, 其中mentions的部分"Verizon"和”MCI“被分别link到对应的"Verizon Communications"和”MCI Inc“。然后我们从contexts surrounding company pairs里,提取出可能的patterns,用于表示对应的target relationship(sec 4.1)。假设我们关注ownership_of 这个关系,而company pair (Verizon Communication, MCI Inc.),正好也在seed set里,然后我们能得到一个candidate pattern =<COMP1, COMP2, acquisition of, →>
。这个pattern最后一个箭头表示relation的方向。然后我们生成一系列candidate patterns,然后用sec 4.2介绍的测量方法,选出最优秀的pattern。选出pattern之后,用于发现input中新的company pairs之间的关系。比如我们已经有了
acquisition of
这个pattern,然后可以从“ . . after Disney’s acquisition of Pixar Animation Studios“ 选出 (The Walt Disney Company, Pixar)。然后,可以用这些新的company pair来扩充seed set。然后就是不断地重复上面的步骤,用extended seed set来生成更多的pattern知道没有更新的comparie pair添加到seed set,或者说迭代的次数到了限定的值。根据当前的pattern提取的compaire pair,也被认为有同样的relation。验证的结果显示不论一开始选择的seed pair如何,最后的seed pairs基本都是一样的。
4 Extraction of Business Relationships
4.1 Pattern generation
为了找到能代表relation的key information,需要一个方法从context中抽取出最能代表relation的phrases。然后用relation和phase当做 key-phrase,用于生成pattern。
Candidate pattern
一个被提取出的pattern包含两个compy的variable,COMP1 和 COMP2,然后key-phrase从这两个companise之间被抽取出来,以及direction。下面是个例子
“. . . YouTube, the video-sharing Web site owned by Google . . . ”
. we can generate the pattern<COMP1, COMP2, owned by, ←>
. By applying this pattern to this sentence we obtain the following instantiation of the pattern<YouTube, Google, owned by, ←>
, indicating that Google owns YouTube.Key-phrase extraction
pattern的质量取决于key-phrase,一个好的pattern要满足两个标准, characterize a single type of relationship (which in turn improves the precision of the extraction result) and be as general as possible (to extract many new company pairs)。(1 只辨明一种关系,2 尽量选择通用的phrase,这样可以提取更多的company pairs)。
基于这两个标准,尽量不选择哪些比较特殊的key-phrases,而且key-phrases要尽量紧凑。从news中提取relation的一个挑战在于记者们喜欢用不同的写作风格来描述同样一个business relationships。比如说“. ”. . News Corporation, which owns a minority interest in DirecTV”.这个句子里,我们可以搞清楚News Corporation是DirecTV的owner,通过找到ovwns这个单词。如果换一种表达方式,(i.e., “, which owns a minority interest in”),就能难找到这个关系了。解决办法是直接提取key-phrase "owns",用于定义ownership relationship。然后我们可以把原来的句子简化为“New Corporation owns DirecTV”
为了做到上面的简化,我们必须从intermediary context(它限定了是在两个entity之间)选择出most determining phrases。从直觉上讲,relation大部分都是通过verb和noun来表达的。比如在The tradeoffs between open and traditional relation extraction(ACL)里,大部分relaton是通过4种phrases来表示的,能表达86%的情况。即,“Verb", "Noun+Prep", "Verb+Prep", 还有一些在两个entity之间的“infinitive”。这里使用了Stanford POS tagger来提取key-phrases。基于POS tag,我们找到符合上面4种情况的类型。另外专门丢弃了哪些包含verb是 “to be”的,因为这个verb无法表明任何relation。
4.2 Pattern selection