dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

[Tokenizers] Question regarding performance #7143

Closed r-Larch closed 1 month ago

r-Larch commented 7 months ago

Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!

I'm the author of the last performance improvements in SharpToken library. Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.

Now I have a question (out of curiosity)

Why is it required to copy a ReadOnlySpan<char> to a buffer, when the rest of the code just uses ReadOnlySpan<char> again?

TiktokenPreTokenizer.cs line: 104 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs#L95-L107

PreTokenizer.cs line: 74 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L43-L54

tarekgh commented 1 month ago

This is because instance of type System.ReadOnlySpan<char> cannot be preserved across 'yield' boundary in SplitText method. Try the following code example:

        public static IEnumerable<(int Offset, int Length)> PreTokenize(ReadOnlySpan<char> text)
        {
            if (text.IsEmpty)
            {
                return [];
            }

            return SplitText(text);

            static IEnumerable<(int Offset, int Length)> SplitText(ReadOnlySpan<char> text)
            {
                for (int i = 0; i < text.Length; i++)
                {
                    yield return (i, 1);
                }
            }
        }

you will get the compilation error: CS4007: Instance of type 'System.ReadOnlySpan<char>' cannot be preserved across 'await' or 'yield' boundary..

I am closing the issue but feel free to respond back with any more questions.