Not all examples carry the same quality of information. Some data is going to be redundant. Identifying the best instances to train a model happens at two key times: before the model is even built and while the model is being trained. The former is called “prioritization(优先顺序).” The latter is called “active learning.”
(每个样本包含的信息的质量是不一样的。一些数据是重复的。辨别一个好样本有两个关键点,在模型构建前,和模型构建时。前者叫作优先级,后者叫作主动学习。)
active learning is a prime example of the marriage of human and machine intelligence. Humans provide the labels that train the model (labeling faster), the model decides what labels it needs to improve (labeling smarter), and humans again provide those labels.
(AC是人类和机器合作的例子。人类提供标签给模型,模型则决定需要哪些标签)
what are the best use cases for active learning? Is it right for your project?
早点选择ac策略比早点构建模型更有利。
Does your model require a lot of training data to reach the right level of accuracy?
If yes, 那么AC是一个好选择。因为需要的训练样本会变少
Is your model prone to underfitting? Is it too simplistic?
If yes,ac是一个好选择。ac选取样本的时候,会解决欠拟合,因为选取的样本会有助于区分不同标签。
Do you have too much data but aren’t sure which rows are the most informative?
If yes,ac是一个好选择。
Does your data have a lot of known duplicates?
If yes,ac是一个好选择。大部分ac不会选取那些重复的标签样本去标注
Are you running out of labeling budget but need more labels?
If yes,ac是一个好选择。毫无疑问能减少所需预算。
Are your labels particularly expensive?
If yes,ac是一个好选择。
WHEN ACTIVE LEARNING MIGHT NOT BE THE RIGHT CHOICE
下面是一些不适合使用AC的情况
Are you using a pre-trained model?
Does your model require a small amount of data?
Does your data have tons of features/columns? 维度诅咒的问题让模型训练变得困难,ac的效果会变差
Is your model prone to overfitting? 模型可能回去选择那些已经能辨别正确的标签的样本,导致过拟合
ARE THERE PARTICULAR DATA TYPES THAT WORK BEST FOR ACTIVE LEARNING?
Active learning can work for any application. NLP, computer vision, speech-to-text, video
SOME FINAL THOUGHTS
The fact remains that active learners get better accuracy with fewer rows than generic supervised approaches. And that’s never a bad thing. Especially when it frees up a little budget for the R&D project you’ve been waiting to try.
Summary:
来自figure eight的关于AC的介绍资料
Resource:
Paper information:
Notes:
课题:快速标注的同时,标注那些更加具代表性的样本
Not all examples carry the same quality of information. Some data is going to be redundant. Identifying the best instances to train a model happens at two key times: before the model is even built and while the model is being trained. The former is called “prioritization(优先顺序).” The latter is called “active learning.” (每个样本包含的信息的质量是不一样的。一些数据是重复的。辨别一个好样本有两个关键点,在模型构建前,和模型构建时。前者叫作优先级,后者叫作主动学习。)
active learning is a prime example of the marriage of human and machine intelligence. Humans provide the labels that train the model (labeling faster), the model decides what labels it needs to improve (labeling smarter), and humans again provide those labels. (AC是人类和机器合作的例子。人类提供标签给模型,模型则决定需要哪些标签)
How AC work
决定是否选择一个特殊的样本取决于,获取这个样本的成本和这个该样本带来的信息质量的差。
下面是3种在线流式的取样方法:
Pool-based approaches
Stream-based selective sampling
Membership query synthesis scenario
HOW DOES AN ACTIVE LEARNER DECIDE WHICH ROW TO LABEL?
下面都是针对pooling方法的,如何选取下一个要标注的数据?
Uncertainty sampling
Query by Committee (QBC)
Expected impact
Density-weighted methods
上面这些模型有共同的部件:
这些不同的策略其实是可以互相补强的。
what are the best use cases for active learning? Is it right for your project?
早点选择ac策略比早点构建模型更有利。
Does your model require a lot of training data to reach the right level of accuracy?
Is your model prone to underfitting? Is it too simplistic?
Do you have too much data but aren’t sure which rows are the most informative?
Does your data have a lot of known duplicates?
Are you running out of labeling budget but need more labels?
Are your labels particularly expensive?
WHEN ACTIVE LEARNING MIGHT NOT BE THE RIGHT CHOICE
下面是一些不适合使用AC的情况
ARE THERE PARTICULAR DATA TYPES THAT WORK BEST FOR ACTIVE LEARNING?
Active learning can work for any application. NLP, computer vision, speech-to-text, video
SOME FINAL THOUGHTS
The fact remains that active learners get better accuracy with fewer rows than generic supervised approaches. And that’s never a bad thing. Especially when it frees up a little budget for the R&D project you’ve been waiting to try.