Subword tokenization - Githubissues

CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning

Apache License 2.0

25 stars 15 forks source link

Subword tokenization #162

Open bonham79 opened 4 months ago

bonham79 commented 4 months ago

What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). But wonder if we see it as worthwhile as also writing up the scripting so it can be managed by the repo as well?

kylebgorman commented 4 months ago

I am weakly opposed. It is a big source of complexity in FairSeq and we don't have any reason to suppose it improves things on this task. (That said, fork and try it out and if it works better than expected...)

The one context I could imagine something vaguely similar if if we support using pretrained encoders---which we should. (I think there's an existing issue for that.) Then you'd just delegate the tokenization to the model's tokenizer.

Adamits commented 4 months ago

I think maybe an example (in /examples) would be appropriate if we want to do this, where you use existing or custom code to tokenize your data with your tokenizer of choice, write it to a new train/dev/test file, and then run yoyodyne on the data?

kylebgorman commented 4 months ago

examples is the wild west, do what you will there, within reason ;)

bonham79 commented 4 months ago

Those were my exact thoughts. Use if wanted, drop if not necessary. Probably will do decently on deep orthography inflection tasks.