Why you randomly sample 10000 unlabeled data points first?

GWwangshuo commented 4 years ago

Thanks for you implementation. I attempted to run your code and noticed in main.py, you firstly shuffle the unlabeled set and select 10000 unlabeled data points rather than the whole unlabeled data points.

# main.py
253  # Randomly sample 10000 unlabeled data points
254 random.shuffle(unlabeled_set)
255 subset = unlabeled_set[:SUBSET]

My understanding is that the sampling should happen in the whole unlabeled data points rather than part of the whole unlabeled data points. Am I correct? You do this just for fast training or there are something behind this operation? Have you tried selecting samples on the whole data points? Do they give the similar performance? Thanks.

Mephisto405 commented 4 years ago

Hello~!

Please look over the section 4.1. Image Classification, especially Dataset paragraph: "As studied, selecting K-most uncertain samples from such a large pool often does not work well, ....blah blah... We adopt this simple yet efficient scheme and set the subset size to M=10000"

GWwangshuo commented 4 years ago

Hello~!

Please look over the section 4.1. Image Classification, especially Dataset paragraph: "As studied, selecting K-most uncertain samples from such a large pool often does not work well, ....blah blah... We adopt this simple yet efficient scheme and set the subset size to M=10000"

That makes sence. It turns out selecting K-most uncertain samples giving worse performance. Thanks for your clarifying. Moreover, In my experiment, the phenonmenon that sampling by the ground truth loss performs worse than sampling with by learning loss is also related to this.

Mephisto405 / Learning-Loss-for-Active-Learning

Why you randomly sample 10000 unlabeled data points first? #4