dcai-course / dcai-lab

Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 👩🏽‍💻
https://dcai.csail.mit.edu/
GNU Affero General Public License v3.0
431 stars 154 forks source link

Already labeled data is labeled again in every iteration in "Growing Datasets" #9

Open Alx-Wo opened 1 year ago

Alx-Wo commented 1 year ago

Hi, I'm just doing this course out of personal interest. In

def passive_selection(x, labeled, label_func, n):
    candidates = set(range(0, len(x))) - set(labeled)
    labeled = np.concatenate([labeled, random.sample(list(candidates), n)])
    labels = [label_func(example) for example in x[labeled]]
    return labeled, labels

and

def active_selection(x, labeled, label_func, n):
    labels = [label_func(example) for example in x[labeled]]
    candidates = set(np.arange(len(x))) - set(labeled)
    # YOUR CODE HERE
    pass

both functions apply the label_func to all samples in labeled, so already labeled data will always be rel-labeled. Is there a reason for re-labeling already labeled examples in every iteration. It does not really matter since the label_func is O(1) but in practice this would be very bad I assume?