Book-2012-Natural Language Annotation for Machine Learning

Summary:

这是本关于标注的书。

Resource:

pdf
[code](
[paper-with-code](

Paper information:

Author:
Dataset:
keywords:

Notes:

我发现自己对于AL的看法是错的。我一直以为AL是用于annotation的，但其实是为了训练模型的一种方法。因为AL的目标并不是尽可能多的去标注数据。只要模型正确率变高了，那么就可以不用再标注下去了。

关于下面第12章，众包标注是今后一个趋势。而对于大量数据，boostring， active learning, semi-supervised learning则是三种方案。

Amazon’s Mechanical Turk
- 便宜有快速，但问题是质量不保证。但是研究者们发现标注者很按照标注要求进行标注，每个标注者的标注质量变化也很大。
Games with a Purpose (GWAP)
- 让标注变为一种游戏的方法，目的是为了获取高质量的数据，但无法完全免除messy data。
User-Generated Content
- MIT Open Mind Common Sense 这个项目是让用户主动输入一些常识。因为没有利益在其中，所以用户更倾向于输入一些正常的，质量高的数据（现在这个项目没有了）

Handling Big Data

之前的问题都是annotation side，这部分是ML side。而最关键的策略就是如何最大限度利用好少量的标注数据，以及如何利用好大数据的一些特性。（The strategy shared by all of the approaches we’ll cover in this section is to try to make the best of as little annotated (training) data as possible, and to leverage different properties of the Big Data. ）

大数据的定义：体积，速度和变化（volume, velocity, and variety）

Boosting

使用监督式算法将一些weak learner集合成一个strong learner。比如adaboost。

Active Learning

算法本身能选择数据，然后将要辨别的数据对标注者进行确认。重点是如何自动询问出合适的问题给标注者（The trick is for the learning algorithm to automatically figure out how to ask the most appropriate question to the oracle (the human annotator).）
下面是一些ac中选择问题的策略
- Uncertainty sampling：选择最不确定分类的样本给标注者
- Query-by-committee：committee是一些针对数据而设计的假设，用于投票表示数据中哪一部分分歧最大。
- Expected model change：如果样本的标签能确定，会对模型影响最大（有点像是决策树里根据信息熵来选定节点）
- Expected error reduction：选择能最大减少learner产生的error的样本
- Variance reduction：减少model产生的noise
- Density-weighted methods：选择最有代表性的数据

Semi-Supervised Learning

同时使用标注数据和未标注数据。
在一些情景下，我们一开始并不知道那些数据是属于同一类的。可以先试用聚类算法找到一些具有代表性的样本，然后作为训练数据，给监督式模型进行训练。7-5示例。

12. Afterword: The Future of Annotation

Model Graph:

Result:：

Thoughts:

Next Reading:

BrambleXu / knowledge-graph-learning

Book-2012-Natural Language Annotation for Machine Learning #322