bclarkson-code / Tricycle

Autograd to GPT-2 completely from scratch
104 stars 7 forks source link

Add regex and special character support to tokeniser #40

Closed bclarkson-code closed 2 months ago

bclarkson-code commented 5 months ago

The tokeniser should be extended to allow special characters (e.g pad tokens) and regex parsing (to e.g avoid merging across words)