Hierarchical universes and a tokenizer config

NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).

Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.

Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers

databio / gtars

Hierarchical universes and a tokenizer config #25