[Tokenizers] Question regarding performance

dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.

MIT License

9.05k stars 1.88k forks source link

Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!

I'm the author of the last performance improvements in SharpToken library. Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.

Now I have a question (out of curiosity)

Why is it required to copy a ReadOnlySpan<char> to a buffer, when the rest of the code just uses ReadOnlySpan<char> again?

TiktokenPreTokenizer.cs line: 104 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs#L95-L107

PreTokenizer.cs line: 74 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L43-L54

public static IEnumerable<(int Offset, int Length)> PreTokenize(ReadOnlySpan<char> text) { if (text.IsEmpty) { return []; } return SplitText(text); static IEnumerable<(int Offset, int Length)> SplitText(ReadOnlySpan<char> text) { for (int i = 0; i < text.Length; i++) { yield return (i, 1); } } }

dotnet / machinelearning

[Tokenizers] Question regarding performance #7143