dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

Tokenizer.Decode special-cases EnglishRoberta #7010

Closed stephentoub closed 4 months ago

stephentoub commented 4 months ago

This suggests there's something wrong with the Model abstraction, and it means that any other model of a similar ilk to EnglishRoberta could not be supported (or not supported efficiently, or whatever reason caused this to be special-cased here). The special-casing should be removed and the abstraction fixed to make the special-casing unnecessary. https://github.com/dotnet/machinelearning/blob/4635a862ddd21b3e7de0404f73a897fecb2011a1/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L203-L216

stephentoub commented 4 months ago

cc: @tarekgh

tarekgh commented 4 months ago

@stephentoub this was in the abstraction and I removed it from there because the functionality was specific to this tokenizer only. This is why I made it this way. I have no idea ever any other tokenizers will need that. If we see cases like this in the future we can decide to add it to the abstraction. But for now I suggest not adding it.

tarekgh commented 4 months ago

The other idea is to have Model.IdToToken always return the filtered token in EnglishRoberta tokenizer.

CC @michaelgsharp @LittleLittleCloud

stephentoub commented 4 months ago

@stephentoub this was in the abstraction and I removed it from there because the functionality was specific to this tokenizer only. This is why I made it this way. I have no idea ever any other tokenizers will need that. If we see cases like this in the future we can decide to add it to the abstraction. But for now I suggest not adding it.

There are only three tokenizers in the library. If 1/3rd of them needs whatever this is, seems likely another will as well. And if someone wanted to provide their own EnglishRoberta equivalent (maybe faster, or with subtly different semantics), how would they?

stephentoub commented 4 months ago

I just noticed that Tiktoken is also special-cased by Decode. That's two models that Decode is special-casing. Certainly seems like there's something wrong with the abstraction.

tarekgh commented 4 months ago

Ok, will think about these two cases.

stephentoub commented 4 months ago

Thanks :)