Closed stephentoub closed 4 months ago
cc: @tarekgh
@stephentoub this was in the abstraction and I removed it from there because the functionality was specific to this tokenizer only. This is why I made it this way. I have no idea ever any other tokenizers will need that. If we see cases like this in the future we can decide to add it to the abstraction. But for now I suggest not adding it.
The other idea is to have Model.IdToToken
always return the filtered token in EnglishRoberta
tokenizer.
CC @michaelgsharp @LittleLittleCloud
@stephentoub this was in the abstraction and I removed it from there because the functionality was specific to this tokenizer only. This is why I made it this way. I have no idea ever any other tokenizers will need that. If we see cases like this in the future we can decide to add it to the abstraction. But for now I suggest not adding it.
There are only three tokenizers in the library. If 1/3rd of them needs whatever this is, seems likely another will as well. And if someone wanted to provide their own EnglishRoberta equivalent (maybe faster, or with subtly different semantics), how would they?
I just noticed that Tiktoken is also special-cased by Decode. That's two models that Decode is special-casing. Certainly seems like there's something wrong with the abstraction.
Ok, will think about these two cases.
Thanks :)
This suggests there's something wrong with the Model abstraction, and it means that any other model of a similar ilk to EnglishRoberta could not be supported (or not supported efficiently, or whatever reason caused this to be special-cased here). The special-casing should be removed and the abstraction fixed to make the special-casing unnecessary. https://github.com/dotnet/machinelearning/blob/4635a862ddd21b3e7de0404f73a897fecb2011a1/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L203-L216