databio / gtars

Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
3 stars 1 forks source link

Hierarchical universes and a tokenizer config #25

Open nleroy917 opened 4 months ago

nleroy917 commented 4 months ago

NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).

Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.

Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers