NUS-HPC-AI-Lab / InfoBatch

Lossless Training Speed Up by Unbiased Dynamic Data Pruning
318 stars 18 forks source link

For Tabel 1 and prune ratio #16

Closed Feng-Hong closed 9 months ago

Feng-Hong commented 10 months ago

The prune ratio in your code seems to be only for samples where the loss is less than the mean? Is this unfair to other methods that prune across all samples (Tab. 1)?

If the comparison is fair, i.e. infobatch's prune ratio is also calculated for all samples, how is the 70% prune ratio achieved, since you keep all samples where the loss is greater than average?

Yeez-lee commented 10 months ago

The prune ratio in your code seems to be only for samples where the loss is less than the mean? Is this unfair to other methods that prune across all samples (Tab. 1)?

If the comparison is fair, i.e. infobatch's prune ratio is also calculated for all samples, how is the 70% prune ratio achieved, since you keep all samples where the loss is greater than averag

Same Question. Based on the method and codes, if prune ratio is 70%, the overall prune ratio could be less than 70% since the pruning is only for samples where the loss is less than the mean.

henryqin1997 commented 10 months ago
  1. The prune ratio calculated is divided by the total data amount. 30% prune ratio is actually below the default prune ratio, because cross entropy loss has a skewed distribution with mean larger than median.
  2. 70% pruning ratio can be achieved in two ways to control the computation. As marked in the caption, one way is to control the epoch number. Another way is more complex, instead of using the mean as the pruning threshold, one can use multiple quantiles (percentile threshold, like 30%, 50%, 75% threshold) with multiple pruning ratios. This part of code will be updated soon, and we will add more discussion of this part in the updated version (in Feb).