MIT Open Mind Common Sense 这个项目是让用户主动输入一些常识。因为没有利益在其中,所以用户更倾向于输入一些正常的,质量高的数据(现在这个项目没有了)
Handling Big Data
之前的问题都是annotation side,这部分是ML side。而最关键的策略就是如何最大限度利用好少量的标注数据,以及如何利用好大数据的一些特性。(The strategy shared by all of the approaches we’ll cover in this section is to try to make the best of as little annotated (training) data as possible, and to leverage different properties of the Big Data. )
算法本身能选择数据,然后将要辨别的数据对标注者进行确认。重点是如何自动询问出合适的问题给标注者(The trick is for the learning algorithm to automatically figure out how to ask the most appropriate question to the oracle (the human annotator).)
Summary:
这是本关于标注的书。
Resource:
Paper information:
Notes:
关于下面第12章,众包标注是今后一个趋势。而对于大量数据,boostring, active learning, semi-supervised learning则是三种方案。
Handling Big Data
之前的问题都是annotation side,这部分是ML side。而最关键的策略就是如何最大限度利用好少量的标注数据,以及如何利用好大数据的一些特性。(The strategy shared by all of the approaches we’ll cover in this section is to try to make the best of as little annotated (training) data as possible, and to leverage different properties of the Big Data. )
大数据的定义:体积,速度和变化(volume, velocity, and variety)
Boosting
Active Learning
Semi-Supervised Learning
12. Afterword: The Future of Annotation
Model Graph:
Result::
Thoughts:
Next Reading: