Closed luisquintanilla closed 1 month ago
@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.
https://github.com/dotnet/machinelearning/blob/8e3f72d0239d74d5c3cd681e7027edc65c18f1a0/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs#L22 https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21
Do you still want to change that?
@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.
https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21 Do you still want to change that?
I see. Thanks for clarifying. If we're consistent I'd leave it. I know you're working on adding special tokens.
Is that happening before or after pretokenization?
I think that's where discrepancies may lie.
When using WhiteSpace pretokenizer, the special tokens [CLS]
for example get broken up as shown in the code example. I'd think that [CLS]
should be considered a single token, even after pretokenization.
@luisquintanilla it seems like we are consistent with huggingface here:
Were you noticing a problem that you thought was caused by this (in which case we should try to diagnose that) or were you just calling the API directly and expecting it to behave differently?
Thanks for investigating.
I was under the impression that [CLS]
would be treated differently since it's a special token.
I think we can close this one since we're consistent.
Given the following code:
The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.
Output:
I would expect it to only split on whitespace based on the name.