dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

Missing Sync equivalent to Microsoft.ML.Tokenizers Tiktoken.CreateByModelName without vocab stream #7077

Closed tonybaloney closed 3 months ago

tonybaloney commented 3 months ago

The only way to instantiate the Tiktoken tokenizer without a vocab stream is to use the Async method https://github.com/dotnet/machinelearning/blob/main/src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs#L778C9-L782C95

Task<Tokenizer> CreateByModelNameAsync(
                                                string modelName,
                                                IReadOnlyDictionary<string, int>? extraSpecialTokens = null,
                                                Normalizer? normalizer = null,
                                                CancellationToken cancellationToken = default)

Please can there be an overload to the Sync CreateByModelName() method so that we can instantiate a Tiktoken tokenizer from just a model name without having to call it asynchronously.

tonybaloney commented 3 months ago

I'm using version 0.22.0-preview.24162.2