EleutherAI / project-menu

See the issue board for the current status of active and prospective projects!
65 stars 4 forks source link

[RFP] Iso-effective-context byte level vs BPE tokenization #29

Closed leogao2 closed 1 year ago

leogao2 commented 3 years ago

Background

Apparently byte level models do worse than BPE models (todo: find where this is from / show in our own experiments) even if the embd at each position is the same size. This is generally taken to mean that byte level isn't efficient for learning, but there's another possibility: it could just do worse because it has smaller effective context (and less context means the model has a harder time).

What to plot?

Here are all the experiments I'd want to run:

Eval is BPB and eval harness probably.

We can also run these for various degrees of in-between using different vocab sizes for bpe and seeing if there's a smooth trend.