[UNLEARNING+EVALS] Run continuous unlearning for a longer time to see what's gonna happen compared to batch & seq.

TheRootOf3 commented 5 months ago

So far, we've run continuous unlearning for up to 1200 steps (eq. 2400 samples), while sequential and batch ran for longer, leading to more optimiser steps. A corresponding ablation study should be performed, where continuous unlearning runs for more optimiser steps.

Now, the following are worth noting:

Sequential (a | b | c | d) x n
- the number of steps is invariant of the number of splits into which the entire unlearn set is divided. Hence, if divided into more splits, then each split will be unlearnt for less steps.
- Unlearning smaller dataset -> less steps (scaled linearly with the dataset size) (as expected)
- E.g.
  - ds 1024, split 64, batch size 4, epoch 20 -> 1024/4*20 = 5120 -> each split unlearned 1024/4/64 = 4 time
  - ds 1024, split 16, batch size 4, epoch 20 -> 1024/4*20 = 5120 -> each split unlearned 1024/4/16 = 16 times
  - ds 512, split 64, batch size 4, epoch 20 -> 512/4*20 = 2560 -> each split unlearned 512/4/64 = 2 times
    - Batch (abcd) x n
    - Similarly to sequential, number of steps scales linearly with the dataset size (as expected).
    - The number of batch steps is the same as sequential steps for the same dataset size!
    - E.g.
    - ds 1024, batch size 4, epoch 20 -> 20*256 = 5120
    - ds 512, batch size 4, epoch 20 -> 20*128 = 2560
    - Continuous (abcdef…)
    - Ran up to 1200 steps (batch size 2, hence equivalent 2400 samples).

Willmish commented 5 months ago

The argument we are trying to make, based on our prior results, if we match number of steps in continuous unlearning to sequential for 1024 sample size: (5120 steps): "For sequential unlearning, with the same number of unlearning steps (5120) as continuous unlearning, and ~10 times less data (1024 for sequential, 5120*2=10240 for continuous) we can achieve the same/better/(hopefully not worse) results in unlearning (based on eval benchmarks)."

Willmish commented 5 months ago

What we should be doing here however (AND WHEN VISUALISING IN #82 ), is logigng vs AMOUNT OF DATA PROCESSED (unlearned samples count - ltierally how many samples of text went through the alogrithm) cause this will be more aligned

Willmish commented 5 months ago

how about logical batch axes? (how many actual gradient ascent steps are taken) - this shows which method is quicker, but heavily unaligned X axis! (batch is 20 steps, sequentials will be different across splits, but the same across number of splits,

Adamliu1 / SNLP_GCW

[UNLEARNING+EVALS] Run continuous unlearning for a longer time to see what's gonna happen compared to batch & seq. #75