dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.01k stars 1.88k forks source link

[Tokenizers] WhiteSpace pretokenizer not only splitting on whitepace #7183

Closed luisquintanilla closed 1 month ago

luisquintanilla commented 3 months ago

Given the following code:

using Microsoft.ML.Tokenizers;

ReadOnlySpan<char> sentence = "[CLS] I love AI [SEP]";

var pretokenizer = new WhiteSpace();

var tokens = pretokenizer.PreTokenize(sentence);

// Actual
Console.WriteLine("Actual");
foreach(var result in tokens)
{
    var substr = sentence.Slice(result.Offset, result.Length);
    Console.WriteLine(substr.ToString());
}

Console.WriteLine("----");

// Expected
Console.WriteLine("Expected");
foreach(var expected in sentence.ToString().Split(' '))
{
    Console.WriteLine(expected);
}

The WhiteSpace tokenizer is splitting based on non-alphanumeric and whitespace characters.

Output:

Actual:
[
CLS
]
I
love
AI
[
SEP
]
----
Expected:
[CLS]
I
love
AI
[SEP]

I would expect it to only split on whitespace based on the name.

tarekgh commented 3 months ago

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

https://github.com/dotnet/machinelearning/blob/8e3f72d0239d74d5c3cd681e7027edc65c18f1a0/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs#L22 https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21

Do you still want to change that?

luisquintanilla commented 2 months ago

@luisquintanilla we are doing as what Huggingface doing with the white space pre-tokenizers.

https://github.com/dotnet/machinelearning/blob/8e3f72d0239d74d5c3cd681e7027edc65c18f1a0/src/Microsoft.ML.Tokenizers/PreTokenizer/Whitespace.cs#L22

https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/pre_tokenizers/whitespace.rs#L21 Do you still want to change that?

I see. Thanks for clarifying. If we're consistent I'd leave it. I know you're working on adding special tokens.

Is that happening before or after pretokenization?

I think that's where discrepancies may lie.

When using WhiteSpace pretokenizer, the special tokens [CLS] for example get broken up as shown in the code example. I'd think that [CLS] should be considered a single token, even after pretokenization.

ericstj commented 1 month ago

@luisquintanilla it seems like we are consistent with huggingface here: image

Were you noticing a problem that you thought was caused by this (in which case we should try to diagnose that) or were you just calling the API directly and expecting it to behave differently?

luisquintanilla commented 1 month ago

Thanks for investigating.

I was under the impression that [CLS] would be treated differently since it's a special token.

I think we can close this one since we're consistent.