dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.91k stars 1.86k forks source link

[Tokenizers] Question regarding performance #7143

Open r-Larch opened 2 months ago

r-Larch commented 2 months ago

Hi, thanks for the effort put into the Microsoft.ML.Tokenizers!

I'm the author of the last performance improvements in SharpToken library. Since MLTokenizers are faster now than SharpToken I looked into the sources to understand where this performance comes from.

Now I have a question (out of curiosity)

Why is it required to copy a ReadOnlySpan<char> to a buffer, when the rest of the code just uses ReadOnlySpan<char> again?

TiktokenPreTokenizer.cs line: 104 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/TiktokenPreTokenizer.cs#L95-L107

PreTokenizer.cs line: 74 https://github.com/dotnet/machinelearning/blob/72cfdf611a510ba0570170a708ddcc1a1928f329/src/Microsoft.ML.Tokenizers/PreTokenizer/PreTokenizer.cs#L43-L54