dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
8.92k stars 1.86k forks source link

More API feedback on Tokenizer and friends #7013

Closed stephentoub closed 4 months ago

stephentoub commented 4 months ago

Tokenizer:

Token:

PreTokenizer:

NormalizedString:

Trainer:

Model:

TikToken:

I suggest we take the surface area of Microsoft.ML.Tokenizers through an API review to get more eyes on it. cc: @terrajobst, @tarekgh

stephentoub commented 4 months ago

Also, why are there two TokenToIds, one abstract without a skipSpecialTokens and one virtual with it? https://github.com/dotnet/machinelearning/blob/f97642412a0dfb1ebba0c4f159bb22422dd70888/src/Microsoft.ML.Tokenizers/Model/Model.cs#L70-L78 Why not just have the one abstract method with the extra parameter? That what appears to be done for the other methods.

tarekgh commented 4 months ago

Why not just have the one abstract method with the extra parameter? That what appears to be done for the other methods.

Yes, I recall I mentioned this somewhere before. We should only keep the one with the extra parameters.

tarekgh commented 4 months ago

Tokenizer:

Tokenizer is mutable after creation? e.g. you can set its PreTokenizer, Normalizer, and Decoder after it's created. If the intent is for a Tokenizer to be created once for a process and shared, is that mutability warranted / wise? Should the constructor accept a Decoder (it doesn't today) and then the properties made get-only? Or at a minimum should the properties be made init instead of set?

Having the Tokenizer immutable will make sense. I didn't see any scenario so far that required changing the normalization or the pre-tokenization. The only drawback from this is users will need to create more than one tokenizer if need to change any of its properties. So far, I haven't seen this scenario. I'll track this to change it. I'll look at adding the Decoder too. 

There are a bunch of parameters named "sequence", with XML comments then describing them as "text". Would "text" be a more obvious name for the parameter?

We inherited this name from the huggingface as I recall but I do not see any reason we should stick with that. I'll track renaming them to text as suggested.

I find the double-negative of "bool skipSpecialTokens = false" confusing: that being false means special tokens should be considered. Would it be better as "bool considerSpecialTokens = true" or something similar? I also don't know what it means to "skip special tokens"... they're not actually being skipped, right? They might still be returned as tokens, they're just not treated any differently from all other tokens. If we were actually "skipping special tokens", I'd expect that the implementation would always be looking for them and just not yielding / counting / etc. them when discovered, and I don't believe that's what's happening.

In Tiktoken they are not skipped but not getting special handling. In Bpe especially during decoding they are really skipped from the decoding process. We can just call it "handleSpecialToken = true" and this should work for both cases. What do you think?

Tokenizer supports both Encode and Decode, but the result from Encode is a TokenizerResult type. Shouldn't that type be named something specific to encoding, since it's not relevant to other operations Tokenizer exposes?

EncodingResult would make more sense. I recall when we introduced this name there was not a strong opinion about it. I am fine changing this name to a better one.

What is the purpose of TrainFromFiles? It seems to produce an IReadOnlyList that's then entirely ignored. Should this method be deleted entirely?

TrainFromFiles can be used to generate tokenizer data (like the vocab files) from the data files. So, users can generate their own vocab or any other data. This is also supported by huggingface. So far, I haven't seen anyone using it (at least internally). If we remove this functionality, we'll have to recommend to the users go and use some other libraries to generate data. For IReadOnlyList<AddedToken>? addedTokens that is not used, this was a Bpe scenario which we decided not supported for now, but we were considering supporting it later if needed. This is about allowing adding more special tokens to the data. the following comment in the code were explaining that:

            // To Do: support added vocabulary in the tokenizer which will include this returned special_tokens.
            // self.add_special_tokens(&special_tokens);

Anyway, I am ok if we remove the training piece from the library till we find a real customer and then we can return it back. 

Token:

Why is this mutable?

I think this was overlooking.  

Typically in .NET APIs that deal with slices of things specify an offset and a length, typically as separate properties. These APIs are specifying starting and ending indices, as a tuple, named Index and End. That feels very inconsistent. I'd have expected an Offset/Index property and a Length property. Same feedback for Split and any other similar types.

I am ok to change the end index to length to be consistent. again, we did that originally to be close with huggingface. 

PreTokenizer:

The parameter to PreTokenizer is called "sentence". Why is it called "sentence" here, "sequence" in other places, "text" or "input" in various comments, etc. Must this actually be a "sentence"?

As discussed, I am going to change the naming text all the way in all APIs tokenizer, pre-tokenizer, ...etc.

NormalizedString:

It's not clear to me what the int[] mapping is. I see the description in the comment, but it's not clear enough. I'm guessing that mapping[i] for i being a position in original is the corresponding index into normalizedString?

This map from the normalized string index to the original string index. its name is "NormalizedToOriginalMapping". We can choose a different name if we need to. Currently we always return the offset to the original string. We can decide to get rid of the mapping and always return the offsets to normalized string instead. This will be another difference with huggingface too, but it is reasonable for reducing the complexity. I'll change that.

The constructor parameters' names don't match the property names, e.g. ctor parameter is normalizedString but the property is Normalized, ctor parameter is mapping but property is NormalizedToOriginalMapping, etc.

I'll track fixing that.

Trainer:

Does anyone use this Trainer stuff? Can it be deleted? All the Progress stuff seems only relevant to that. Can it be deleted, too?

I replied to the above. I am ok to delete this part for now till we find a real customer for it.

Model:

Methods should start with a verb. TokenToId and IdToToken don't follow the typical guidance.

I'll track fixing that.

Why is GetVocabSize needed? Can't someone just use GetVocab().Count? Seems like that's what all the implementations do.

Inherited from huggingface, I'll track remove it.

Why is GetVocab a method rather than a Vocab (or Vocabulary) property?

Inherited from huggingface, I'll track fix it.

Are Save and GetTrainer needed?

I already replied regarding the trainer. Save, is tied to the trainer. If we delete the trainer, we should delete Save too.

TikToken:

tikTokenBpeFile parameter should be bpeFilePath or something like that. If it's just called file, folks might infer that to mean it's the contents of the file.

I'll track fixing that.

specialTokensEncoder is a strange name. It's just a dictionary and doesn't do any encoding. Should it just be specialTokens?

Could be. I picked this name from MS tokenizer library.

tikTokenBpeFileStream parameter suggests this must be a "FileStream"... it needn't be. This should just be bpeStream or something like that.

I'll track fixing that.

I suggest we take the surface area of Microsoft.ML.Tokenizers through an API review to get more eyes on it.

I'll do the cleanup first and try to take it to design review.