Open gevorgter opened 1 year ago
Also, i believe you misunderstanding the token_type_ids
from here https://huggingface.co/transformers/v3.2.0/glossary.html#token-type-ids " The first sequence, the “context” used for the question, has all its tokens represented by a 0, whereas the second sequence, corresponding to the “question”, has all its tokens represented by a 1.
Some models, like XLNetModel use an additional token represented by a 2."
So normally for BERT it's 0 or 1. You are just returning 0,1,2,3... index in the array of strings submitted. Which is not what it represents. If you see my example in Python you will see the output being all 0 for token_type_ids.
Easy to fix #1
public List<(long InputId, long TokenTypeId, long AttentionMask)> Encode(int sequenceLength, params string[] texts)
{
var tokens = Tokenize(texts);
List<long> padding;
if (sequenceLength > tokens.Count)
padding = Enumerable.Repeat(0L, sequenceLength - tokens.Count).ToList();
else
padding = new List<long>();
var tokenIndexes = tokens.Select(token => (long)token.VocabularyIndex).Concat(padding).Take(sequenceLength).ToArray();
var segmentIndexes = tokens.Select(token => token.SegmentIndex).Concat(padding).Take(sequenceLength).ToArray();
var inputMask = tokens.Select(o => 1L).Concat(padding).Take(sequenceLength).ToArray();
var output = tokenIndexes.Zip(segmentIndexes, Tuple.Create)
.Zip(inputMask, (t, z) => Tuple.Create(t.Item1, t.Item2, z));
return output.Select(x => (InputId: x.Item1, TokenTypeId: x.Item2, AttentionMask: x.Item3)).ToList();
}
So in my (limited) experience - the C# version Encode method accepts 1 sentence at a time, I couldn't get it to work with multiple sentences at a time.
Ex: var tokens = tokenizer.Tokenize(sentence); var encoded = tokenizer.Encode(tokens.Count, sentence);
I found this was a helpful resource as well: https://onnxruntime.ai/docs/tutorials/csharp/bert-nlp-csharp-console-app.html
Although I agree with your sentiment - I feel like this one piece is holding me back from really diving into using the models in C#
You could not get it to work with multiple sentences at a time because presented code produces incorrect output. Instead of int [][] it produces int[] for multiple sentences.
There are 3 types of tasks done with Bert. Sentence classifier (let say, negative/positive/neutral tone), Token Classification (company name, date, purchase amount....) and Question answering.
token_type_ids vector only needed for question answering. For training you need to encode 2 sentences. 1 is question and another is answer. This produces encoding as one vector (array) of encodings of 2 sentences separated by token [SEP]. But token_type_ids for question will have 0 and for answer 1. So it never goes about 1. So method encode should look like encode(string question, string answer) for that task.
Normally you do not even need "token_type_ids" in C#. They are only needed for training and it is easier done in Python since you will have hard time training in C#. The inference is much easier since you do not need various loss functions or optimizers and you want it to be fast, so you are probably better of running it with C# than Python.
Not sure if you've experienced this as well, but the Tokenize method also hangs when presented with irregular text. For example, this input will cause the Tokenize method to hang and ultimately produce an OOM exception:
वधनसभ चनव नतज 2023भकपवषण दवभरतय रलबगशवर धमपकसतन आरथक सकटबजनससरकर नकररस-यकरन सकट
Obviously an extreme example but the same occurs for English text with accent marks in the text (think someone's name), etc..
Currently trying to find a way to address this.
Made some improvements to the TokenizeSubwords method in the TokenizerBase to improve general resiliency:
`private IEnumerable<(string Token, int VocabularyIndex)> TokenizeSubwords(string word) { if (this.vocabularyDict.TryGetValue(word, out var count)) { yield return (word, count); yield break; }
var remaining = word;
var iterations = 0;
while (!string.IsNullOrEmpty(remaining) && remaining.Length > 2)
{
string prefix = null;
var subwordLength = remaining.Length;
while (subwordLength >= (iterations == 0 ? 1 : 3))
{
var subword = remaining.Substring(0, subwordLength);
if (!this.vocabularyDict.ContainsKey(subword))
{
subwordLength--;
continue;
}
prefix = subword;
break;
}
if (prefix == null)
{
yield return (Tokens.Unknown, this.vocabularyDict[Tokens.Unknown]);
yield break;
}
remaining = new StringBuilder("##").Append(remaining, prefix.Length, remaining.Length - prefix.Length).ToString();
yield return (prefix, this.vocabularyDict[prefix]);
iterations++;
}
if (!string.IsNullOrWhiteSpace(word) && iterations == 0)
{
yield return (Tokens.Unknown, this.vocabularyDict[Tokens.Unknown]);
}
}`
Will try to get a MR opened up for this. Are you looking into fixing this issue as well @gevorgter ?
tokenizer.Encode takes array of strings. Python version returns array of arrays (long[][]) of InputIds. Your version returns long[] with array flattened.
What's the status on these? Is there a PR or a fork with the changes?
I'm working with the NuGet package. After a fair bit of wrapper work, I got it generating 384-dim vectors from the ONNX export of the all-MiniLM-L6-v2 model. However, the vectors do not match the ones in Python. Wondering if it's something in the tokenizer, or ONNX, or my code.
@ctwardy
Is there a PR or a fork with the changes?
Inspired by this library (thanks for the great work @NMZivkovic!), I built FastBertTokenizer (nuget). I invested in ensuring that the encoding results of FastBertTokenizer match those of Hugging Face's Python and Rust tokenizers in all practical cases. Unit tests ensure this against a 15k document wikipedia corpus which includes right to left languages, exotic scripts and more. Aside from edge cases around characters that can not be encoded using a given dictionary (e.g. when encountering multiple assamese characters in a row, my implementation sometimes emits an [UNK] token more than Hugging Face would; or: Hugging Face sometimes emits [UNK] for words in greek letters despite it is possible to encode them using the given vocabulary, which my implementation does), I'm able to achieve equal encodings. If you give it a try and stumble up on an issue, please let me know.
Note that while my implementation does support batched encoding, it does not return a List<List<(InputIds, AttentionMask, TokenTypeIds)>>
or something like that, because that would be rather inefficient and probably require you to copy the ids in memory for further processing using e.g. ONNXRuntime. My implementation returns (ReadOnlyMemory<long> InputIds, ReadOnlyMemory<long> AttentionMask, ReadOnlyMemory<long> TokenTypeIds)
, which you could e.g. directly feed as a batched model input to many ONNX models and you just need to set the right shape.
Inspired by this library (thanks for the great work @NMZivkovic!), I built [FastBertTokenizer](https://github.com/georg-
Nice. I had used a HuggingFace micro-service instead, but your package should help the C# ML ecosystem.
I would expect tokenizer's behavior to match Python version otherwise it will be hard to convert samples from Python to .NET
Example. Python code: from transformers import BertTokenizer MODEL_NAME = "distilbert-base-uncased"
sentence1 = "George Is the best person ever"; sentence2 = "George Is the best person ever"; sentence3 = "George Is the best person ever"; sentences = [sentence1, sentence2, sentence3]
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME) train_encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512) print(train_encodings)
Output: {'input_ids': [[101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102], [101, 2577, 2003, 1996, 2190, 2711, 2412, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}
C# code: var tokenizer = new BertUncasedBaseTokenizer(); var sentence1 = "George Is the best persone ever"; var sentence2 = "George Is the best persone ever"; var sentence3 = "George Is the best persone ever"; var sentenses = new string[] { sentence1, sentence2, sentence3 }; var encoded = tokenizer.Encode(30, sentenses);
typeof(encoded) = System.Collections.Generic.List<(long InputIds, long TokenTypeIds, long AttentionMask)>(Count = 30) I would have expected List<List<(long InputIds, long TokenTypeIds, long AttentionMask)>>