different behavior: hugging face bert-base-uncased vs. BERT Base Uncased

PaulCalot commented 2 years ago

Hello there,

First, thanks for the very useful package.

I noticed a different behavior that is reproductible with the word "eiffel". Using HuggingFace bert-base-uncased, the tokenization yields : 'e', '##iff', '##el'. However, when using the BERT base uncased tokenizer of this package, I simply get '[UNK]'.

Since I am not sure wether you wanted it to work this way or not, I decided to make an issue out of it.

To solve it, I changed a few lines in TokenizerBase.cs, basically allowing subword of length 1 and replacing only the first occurence of the subword in word by '##' (which should be done in any case).

private IEnumerable<(string Token, int VocabularyIndex)> TokenizeSubwords(string word)
    {
        if (_vocabularyDict.ContainsKey(word))
        {
            return new (string, int)[] { (word, _vocabularyDict[word]) };
        }

        var tokens = new List<(string, int)>();
        var remaining = word;

        while (!string.IsNullOrEmpty(remaining) && remaining.Length > 2)
        {
            string prefix = null;
            int subwordLength = remaining.Length;
            while (subwordLength >= 1) // was initially 2, which prevents using "character encoding"
            {
                string subword = remaining.Substring(0, subwordLength);
                if (!_vocabularyDict.ContainsKey(subword))
                {
                    subwordLength--;
                    continue;
                }

                prefix = subword;
                break;
            }

            if (prefix == null)
            {
                tokens.Add((Tokens.Unknown, _vocabularyDict[Tokens.Unknown]));

                return tokens;
            }

            var regex = new Regex(prefix);
            remaining = regex.Replace(remaining, "##", 1);
            // remaining = remaining.Replace(prefix, "##"); // should only replace the first occurence, not all of them

            tokens.Add((prefix, _vocabularyDict[prefix]));
        }

        if (!string.IsNullOrWhiteSpace(word) && !tokens.Any())
        {
            tokens.Add((Tokens.Unknown, _vocabularyDict[Tokens.Unknown]));
        }

        return tokens;
    }

For the ~200 sentences I tested and compared to the HF version, it worked as expected, however I did not run any thorougher test.

Cheers,

Paul

NMZivkovic commented 2 years ago

Than you so much! This will fix a lot of other problems which were reported.

NMZivkovic commented 2 years ago

Hey, thanks again fro your input! This fix is now a part of 1.2.0. I will close this ticket and we can open a new one if there are any problems.

Cheers

theolivenbaum commented 1 year ago

It seems like this change introduced an infinite loop when trying to tokenize some sentences, for example "El Patrón Repositorio y sus falacias"

NMZivkovic / BertTokenizers

different behavior: hugging face bert-base-uncased vs. BERT Base Uncased #12