The TikToken tokenizers require data to load. Currently we download this from the internet, but also provide an API for callers to specify the stream. We could also provide an API that does not require the stream but loads from an embedded resource. We might also consider other formats which may be smaller in size or faster to load.
The sizes of the files aren't unreasonable to just embed in a binary, or the package itself:
Including all of them in the base library would bloat the library somewhat, and may not be linker friendly. Potential alternatives that might be more linker friendly: granular assemblies for each resource, build targets to opt into embedding a resource, source generator that detects usage and injects the resource, etc.
We could consider similar approaches for other tokenizer configuration.
The TikToken tokenizers require data to load. Currently we download this from the internet, but also provide an API for callers to specify the stream. We could also provide an API that does not require the stream but loads from an embedded resource. We might also consider other formats which may be smaller in size or faster to load.
The sizes of the files aren't unreasonable to just embed in a binary, or the package itself:
Including all of them in the base library would bloat the library somewhat, and may not be linker friendly. Potential alternatives that might be more linker friendly: granular assemblies for each resource, build targets to opt into embedding a resource, source generator that detects usage and injects the resource, etc.
We could consider similar approaches for other tokenizer configuration.