FreshnessSamplingStrategy: Add strategy to prioritize newer data

We can sample from a distribution like $$P(sample[i] = 1) = \frac{\alpha}{n} + i \cdot\frac{2(1-\alpha)}{n(n-1)}$$ Where $\alpha$ is the mixing constant (between uniform and fresh-first) and n is the number of points (the remaining terms are used to make it a distribution). Higher index means newer samples.

Demo code

def distribution(alpha:float, tot_points:int, index:int) -> float: return alpha/(tot_points)+2*(1-alpha)*index/(tot_points*(tot_points-1)) Then compute probabilities for each sample probabilities = [distribution(alpha, tot_points, i) for i in range(tot_points)] And sample accordingly random.choices(range(tot_points),probabilities, k = 10**6))

If we don't want replacement, we can use the analogue method from numpy.

Another idea would be to sample from a gaussian with a variance proportional to the number of samples (so, in the limit, it converges to a uniform) but I don't think we need this complexity

eth-easl / modyn

FreshnessSamplingStrategy: Add strategy to prioritize newer data #125