Closed r-Larch closed 1 month ago
This is because instance of type System.ReadOnlySpan<char>
cannot be preserved across 'yield' boundary in SplitText method. Try the following code example:
public static IEnumerable<(int Offset, int Length)> PreTokenize(ReadOnlySpan<char> text)
{
if (text.IsEmpty)
{
return [];
}
return SplitText(text);
static IEnumerable<(int Offset, int Length)> SplitText(ReadOnlySpan<char> text)
{
for (int i = 0; i < text.Length; i++)
{
yield return (i, 1);
}
}
}
you will get the compilation error: CS4007: Instance of type 'System.ReadOnlySpan<char>' cannot be preserved across 'await' or 'yield' boundary.
.
I am closing the issue but feel free to respond back with any more questions.
Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!
I'm the author of the last performance improvements in
SharpToken
library. Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.Now I have a question (out of curiosity)
Why is it required to copy a
ReadOnlySpan<char>
to a buffer, when the rest of the code just usesReadOnlySpan<char>
again?TiktokenPreTokenizer.cs line: 104 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs#L95-L107
PreTokenizer.cs line: 74 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L43-L54