Summary:

针对如何有效利用weakly label的方法。构建了两个模型。一个分类模型用于从weakly label data中选出高质量的数据，一个sequence labeling模型则在高质量的数据上进行训练。因为是weakly label，没有entity type，所以使用partial-CRF with Non-entity Sampling.

Resource:

pdf
code
[paper-with-code](

Paper information:

Author:
Dataset:
keywords:

Notes:

什么是Low-Resource: inadequate training data

什么是Weakly Labeled Data:

为了利用好WL sentence，将其分成high-quality 和noisy portion两部分，并分别用两个module来处理：

一个classification module用于noisy data，进行pretrian，用来捕捉textual context semantics
一个sequence labeling module用于high-quality data，utilizes Partial-CRFs with non-entity sampling to achieve global optimum

既然classification module是用来pretrian的，为什么是分类模型？目标函数是什么？什么是Partial-CRF 什么是non-entity sampling 什么是global optimum

这几个问题的答案没有仔细看，因为并不是利用词典的方法，

1 Introduction

上图中的s2是WL，B-NT和I-NT因为不知道是什么类型，所以是WL。这种label可以从wiki里大量得到。

但是使用这种WL数据有以下的挑战：

Partially-Labeled Sequence：因为不包含type，所以无法直接用于训练。解决方法是使用Partial-CRF，which assign unlabeled words with all possible labels and maximize the total probability (Yang et al., 2018; Shang et al., 2018 #275 ). 但是这两种方法还是需要seed annotation或者domain dictionaries for high-quality training.
Massive Noisy Data: WL会产生很多包含missing label的noisy data。

Model Graph:

4 Weakly Labeled Data Generation

包含两部分，第一个是 Label Induction 根据wiki的anchor和taxonomy给每个word分一个label。第二个是 data selection scheme，计算WL sentence的quality score。根据这些socre，把数据集分为高质量和低质量两部分。

4.1 Label Induction

根据anchor查看对应wiki的category，比如Formula Shell 链接到了Shell Turbo Chargers，而Shell Turbo Chargers是一个Basketball teams。根据wiki的taxonomy： Basketball teams→...→Organizations,所以最后标记为Organizations。但有两个问题，1 链接的地方没有category信息的话，解决方法是标记为B-NT或I-NT。2 链接指向多个category，此时使用maximizes the conditional probability的方式来推测label。但是这样还是会有 incorrect boundaries and types of labels due to the crowdsourcing nature of source data. 所以引入data selection scheme来解决这个问题。

4.2 Data Selection Scheme

Following Ni et al. (2017), we compute quality scores for sentences to distinguish high-quality and noisy data from two aspects: the annotation confidence and the annotation coverage. （从confidence和coverage两方面判断质量）

annotation confidence measures the likelihood of the text spans being mentions(i.e., correctness of boundaries), and being assigned with types.
The annotation coverage measures to which ratio the words are being labeled in the sentence

关于这两个公式，需要问一下子，从直觉上理解才行。

看了这部分，我感觉我要做的内容其实就是设计一个更好的data selection scheme。主要是为了选出高质量的entity才被设计出来的。只不过是针对company name这个domain。

5.2 Classification Module

分类模型是一个多标签分类模型。

Character and Word Embeddings

表示一个word x的时候，使用word embedding w和CNN based character embedding c。

Encoder Layer

输入任意长度的句子X，这个组件把X变为一个低维向量。这里使用BiLSTM作为encoder。

Result:：

Thoughts:

Next Reading:

Yang et al., 2018： Distantly supervised ner with partial annotation learning and reinforcement learning

Partial-CRFs (PCRFs) (Tackstr ¨ om et al. 2013). : Token and type constraints for cross-lingual part-of-speech tagging.

Ni et al. (2017)(Data Selection Scheme): Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection.

BrambleXu / knowledge-graph-learning

EMNLP-2019/11-Low-Resource Name Tagging Learned with Weakly Labeled Data #281