NMZivkovic / BertTokenizers

Open source project for BERT Tokenizers in C#.
MIT License
83 stars 22 forks source link

This does not match behavior of Huggingface's Python version #18

Open gevorgter opened 1 year ago

gevorgter commented 1 year ago

I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET

  1. tokenizer.Encode should stop when sequenceLength is reached instead of throwing exception. It's not always known what sequence length is going to be.
  2. tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.

Example. Python code: from transformers import BertTokenizer MODEL_NAME = "distilbert-base-uncased"

sentence1 = "George Is the best person ever"; sentence2 = "George Is the best person ever"; sentence3 = "George Is the best person ever"; sentences = [sentence1, sentence2, sentence3]

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512) print(train_encodings)

Output: {'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

C# code: var tokenizer = new BertUncasedBaseTokenizer(); var sentence1 = "George Is the best persone ever"; var sentence2 = "George Is the best persone ever"; var sentence3 = "George Is the best persone ever"; var sentenses = new string[] { sentence1, sentence2, sentence3 }; var encoded = tokenizer.Encode(30, sentenses);

typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30) I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>

gevorgter commented 1 year ago

Also, i believe you misunderstanding the token_type_ids

from here https://huggingface.co/transformers/v3.2.0/glossary.html#token-type-ids " The first sequence, the “context” used for the question, has all its tokens represented by a 0, whereas the second sequence, corresponding to the “question”, has all its tokens represented by a 1.

Some models, like XLNetModel use an additional token represented by a 2."

So normally for BERT it's 0 or 1. You are just returning 0,1,2,3... index in the array of strings submitted. Which is not what it represents. If you see my example in Python you will see the output being all 0 for token_type_ids.

gevorgter commented 1 year ago

Easy to fix #1

public List<(long InputId, long TokenTypeId, long AttentionMask)> Encode(int sequenceLength, params string[] texts)
        {
            var tokens = Tokenize(texts);
            List<long> padding;
            if (sequenceLength > tokens.Count)
                padding = Enumerable.Repeat(0L, sequenceLength - tokens.Count).ToList();
            else
                padding = new List<long>();

            var tokenIndexes = tokens.Select(token => (long)token.VocabularyIndex).Concat(padding).Take(sequenceLength).ToArray();
            var segmentIndexes = tokens.Select(token => token.SegmentIndex).Concat(padding).Take(sequenceLength).ToArray();
            var inputMask = tokens.Select(o => 1L).Concat(padding).Take(sequenceLength).ToArray();

            var output = tokenIndexes.Zip(segmentIndexes, Tuple.Create)
                .Zip(inputMask, (t, z) => Tuple.Create(t.Item1, t.Item2, z));

            return output.Select(x => (InputId: x.Item1, TokenTypeId: x.Item2, AttentionMask: x.Item3)).ToList();
        }
rghavimi commented 1 year ago

So in my (limited) experience - the C# version Encode method accepts 1 sentence at a time, I couldn't get it to work with multiple sentences at a time.

Ex: var tokens = tokenizer.Tokenize(sentence); var encoded = tokenizer.Encode(tokens.Count, sentence);

I found this was a helpful resource as well: https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html

Although I agree with your sentiment - I feel like this one piece is holding me back from really diving into using the models in C#

gevorgter commented 1 year ago
  1. You could not get it to work with multiple sentences at a time because presented code produces incorrect output. Instead of int [][] it produces int[] for multiple sentences.

  2. There are 3 types of tasks done with Bert. Sentence classifier (let say, negative/positive/neutral tone), Token Classification (company name, date, purchase amount....) and Question answering.

  3. token_type_ids vector only needed for question answering. For training you need to encode 2 sentences. 1 is question and another is answer. This produces encoding as one vector (array) of encodings of 2 sentences separated by token [SEP]. But token_type_ids for question will have 0 and for answer 1. So it never goes about 1. So method encode should look like encode(string question, string answer) for that task.

Normally you do not even need "token_type_ids" in C#. They are only needed for training and it is easier done in Python since you will have hard time training in C#. The inference is much easier since you do not need various loss functions or optimizers and you want it to be fast, so you are probably better of running it with C# than Python.

  1. Bert has limitation of the 512 tokens so tokenizer.Encode(tokens.Count, sentence); should be tokenizer.Encode((tokens.Count>512)?512:tokens.Count, sentence);
rghavimi commented 1 year ago

Not sure if you've experienced this as well, but the Tokenize method also hangs when presented with irregular text. For example, this input will cause the Tokenize method to hang and ultimately produce an OOM exception:

वधनसभ चनव नतज 2023भकपवष‍ण दवभरतय रलबगश‍वर धमपकस‍तन आरथक सकटबजनससरकर नकररस-यकरन सकट

Obviously an extreme example but the same occurs for English text with accent marks in the text (think someone's name), etc..

Currently trying to find a way to address this.

rghavimi commented 1 year ago

Made some improvements to the TokenizeSubwords method in the TokenizerBase to improve general resiliency:

`private IEnumerable<(string Token, int VocabularyIndex)> TokenizeSubwords(string word) { if (this.vocabularyDict.TryGetValue(word, out var count)) { yield return (word, count); yield break; }

        var remaining = word;
        var iterations = 0;

        while (!string.IsNullOrEmpty(remaining) && remaining.Length > 2)
        {
            string prefix = null;
            var subwordLength = remaining.Length;
            while (subwordLength >= (iterations == 0 ? 1 : 3))
            {
                var subword = remaining.Substring(0, subwordLength);
                if (!this.vocabularyDict.ContainsKey(subword))
                {
                    subwordLength--;
                    continue;
                }

                prefix = subword;
                break;
            }

            if (prefix == null)
            {
                yield return (Tokens.Unknown, this.vocabularyDict[Tokens.Unknown]);
                yield break;
            }

            remaining = new StringBuilder("##").Append(remaining, prefix.Length, remaining.Length - prefix.Length).ToString();

            yield return (prefix, this.vocabularyDict[prefix]);
            iterations++;
        }

        if (!string.IsNullOrWhiteSpace(word) && iterations == 0)
        {
            yield return (Tokens.Unknown, this.vocabularyDict[Tokens.Unknown]);
        }
    }`

Will try to get a MR opened up for this. Are you looking into fixing this issue as well @gevorgter ?

tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.

ctwardy commented 1 year ago

What's the status on these? Is there a PR or a fork with the changes?

I'm working with the NuGet package. After a fair bit of wrapper work, I got it generating 384-dim vectors from the ONNX export of the all-MiniLM-L6-v2 model. However, the vectors do not match the ones in Python. Wondering if it's something in the tokenizer, or ONNX, or my code.

georg-jung commented 11 months ago

@ctwardy

Is there a PR or a fork with the changes?

Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). I invested in ensuring that the encoding results of FastBertTokenizer match those of Hugging Face's Python and Rust tokenizers in all practical cases. Unit tests ensure this against a 15k document wikipedia corpus which includes right to left languages, exotic scripts and more. Aside from edge cases around characters that can not be encoded using a given dictionary (e.g. when encountering multiple assamese characters in a row, my implementation sometimes emits an [UNK] token more than Hugging Face would; or: Hugging Face sometimes emits [UNK] for words in greek letters despite it is possible to encode them using the given vocabulary, which my implementation does), I'm able to achieve equal encodings. If you give it a try and stumble up on an issue, please let me know.

Note that while my implementation does support batched encoding, it does not return a List<List<(InputIds, AttentionMask, TokenTypeIds)>> or something like that, because that would be rather inefficient and probably require you to copy the ids in memory for further processing using e.g. ONNXRuntime. My implementation returns (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds), which you could e.g. directly feed as a batched model input to many ONNX models and you just need to set the right shape.

ctwardy commented 11 months ago

Inspired by this library (thanks for the great work @NMZivkovic!), I built [FastBertTokenizer](https://github.com/georg-

Nice. I had used a HuggingFace micro-service instead, but your package should help the C# ML ecosystem.