Investigate alternate API / Packaging for Tokenizer data streams

ericstj commented 3 months ago

The TikToken tokenizers require data to load. Currently we download this from the internet, but also provide an API for callers to specify the stream. We could also provide an API that does not require the stream but loads from an embedded resource. We might also consider other formats which may be smaller in size or faster to load.

The sizes of the files aren't unreasonable to just embed in a binary, or the package itself:

1,681,126 cl100k_base.tiktoken
  835,554 gpt2.tiktoken
  836,186 p50k_base.tiktoken
  835,554 r50k_base.tiktoken

Including all of them in the base library would bloat the library somewhat, and may not be linker friendly. Potential alternatives that might be more linker friendly: granular assemblies for each resource, build targets to opt into embedding a resource, source generator that detects usage and injects the resource, etc.

We could consider similar approaches for other tokenizer configuration.

ericstj commented 3 months ago

https://github.com/dotnet/machinelearning/pull/7098

tarekgh commented 3 months ago

Closing it per https://github.com/dotnet/machinelearning/pull/7098.

dotnet / machinelearning

Investigate alternate API / Packaging for Tokenizer data streams #7059