Open TheRootOf3 opened 5 months ago
The argument we are trying to make, based on our prior results, if we match number of steps in continuous unlearning to sequential for 1024 sample size: (5120 steps): "For sequential unlearning, with the same number of unlearning steps (5120) as continuous unlearning, and ~10 times less data (1024 for sequential, 5120*2=10240 for continuous) we can achieve the same/better/(hopefully not worse) results in unlearning (based on eval benchmarks)."
What we should be doing here however (AND WHEN VISUALISING IN #82 ), is logigng vs AMOUNT OF DATA PROCESSED (unlearned samples count - ltierally how many samples of text went through the alogrithm) cause this will be more aligned
how about logical batch axes? (how many actual gradient ascent steps are taken) - this shows which method is quicker, but heavily unaligned X axis! (batch is 20 steps, sequentials will be different across splits, but the same across number of splits,
So far, we've run continuous unlearning for up to 1200 steps (eq. 2400 samples), while sequential and batch ran for longer, leading to more optimiser steps. A corresponding ablation study should be performed, where continuous unlearning runs for more optimiser steps.
Now, the following are worth noting: