eth-easl / modyn

Modyn is a research-platform for training ML models on growing datasets.
MIT License
25 stars 3 forks source link

FreshnessSamplingStrategy: Add strategy to prioritize newer data #125

Open MaxiBoether opened 1 year ago

MaxiBoether commented 1 year ago

Right now, the FreshnessSamplingStrategy without resets and with a limit implements the limit either using a sampling uniformly at random, or by just taking the newest data. We should probably add a simple sampling strategy that mixes between the two, e.g., by assigning higher weights to newer data or so.

francescodeaglio commented 1 year ago

We can sample from a distribution like $$P(sample[i] = 1) = \frac{\alpha}{n} + i \cdot\frac{2(1-\alpha)}{n(n-1)}$$ Where $\alpha$ is the mixing constant (between uniform and fresh-first) and n is the number of points (the remaining terms are used to make it a distribution). Higher index means newer samples.

Demo code

def distribution(alpha:float, tot_points:int, index:int) -> float: return alpha/(tot_points)+2*(1-alpha)*index/(tot_points*(tot_points-1)) Then compute probabilities for each sample probabilities = [distribution(alpha, tot_points, i) for i in range(tot_points)] And sample accordingly random.choices(range(tot_points),probabilities, k = 10**6))

If we don't want replacement, we can use the analogue method from numpy.

Another idea would be to sample from a gaussian with a variance proportional to the number of samples (so, in the limit, it converges to a uniform) but I don't think we need this complexity