[RFP] Iso-effective-context byte level vs BPE tokenization

Background

Apparently byte level models do worse than BPE models (todo: find where this is from / show in our own experiments) even if the embd at each position is the same size. This is generally taken to mean that byte level isn't efficient for learning, but there's another possibility: it could just do worse because it has smaller effective context (and less context means the model has a harder time).

What to plot?

Here are all the experiments I'd want to run:

[ ] 2048-byte vs 2048-BPE as baseline for he original claim
[ ] iso-effective-context byte vs BPE (so if there's about 4 bytes per bpe, then we'd do 8192-byte vs 2048-bpe), same embd
[ ] iso-effective-context and iso-compute (via making embd smaller for the bytelevel model) byte vs BPE

Eval is BPB and eval harness probably.

We can also run these for various degrees of in-between using different vocab sizes for bpe and seeing if there's a smooth trend.

EleutherAI / project-menu

[RFP] Iso-effective-context byte level vs BPE tokenization #29

Background

What to plot?