Performance-critical tools to manipulate, analyze, and process genomic interval data. Primarily focused on building tools for geniml - our genomic machine learning python package.
NLP/huggingface tokenizer vocabularies are often distributed as .json configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).
Should we distributegtokenizers the same way? Instead of a single BED-file, its a .yaml file that points to a BED-file, in addition to other things like maybe a list of exclude_ranges, secondary universes (hierarchical tokenization), etc.
Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers
NLP/huggingface tokenizer vocabularies are often distributed as
.json
configuration files. The reason for this is that modern, language tokenizers are configurable beyond their respective vocabularies (e.g. pre-processors, special tokens, post-processors, etc).Should we distribute
gtokenizers
the same way? Instead of a single BED-file, its a.yaml
file that points to a BED-file, in addition to other things like maybe a list ofexclude_ranges
, secondary universes (hierarchical tokenization), etc.Could be a way to implement hierarchical universes in addition to enhancing the fragment tokenizers