Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!
I'm the author of the last performance improvements in SharpToken library.
Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.
Now I have a question (out of curiosity)
Why is it required to copy a ReadOnlySpan<char> to a buffer, when the rest of the code just uses ReadOnlySpan<char> again?
Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!
I'm the author of the last performance improvements in
SharpToken
library. Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.Now I have a question (out of curiosity)
Why is it required to copy a
ReadOnlySpan<char>
to a buffer, when the rest of the code just usesReadOnlySpan<char>
again?TiktokenPreTokenizer.cs line: 104 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs#L95-L107
PreTokenizer.cs line: 74 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L43-L54