Discussion about Data Labelling in the Vietnamese language

Thank you really much for your grateful project.

I just having confused that,

For instance

Input sentence: “Giao tôi lê_lai phường hai tân_bình hcm”
Value after tokenizer:
{‘input_ids’: [0, 64003, 64003, 17489, 6115, 64139, 64151, 64003, 6446, 64313, 1340, 74780, 2], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Because tokenize of “lê_lai” is [‘lê@@’, ‘l@@’, ‘ai’]; of “tân_bình” is ['tân@@’, ‘bình’]; of “hcm” is [‘h@@’, ‘cm’]
The result I got after all: [‘O’,‘O’,‘B-LOC’,‘I-LOC’,‘I-LOC’,‘I-LOC’, ‘I-LOC’,‘I-LOC’,‘O’,‘I-LOC’,‘I-LOC’, ‘O’]

In fact, their prediction should only have 7 tags for the input tokens, but now it was more than this. Do this project have any strategies for this.

According to HuggingFace's document

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary. For example, DistilBert’s tokenizer would split the Twitter handle @huggingface into the tokens ['@', 'hugging', '##face']. This is a problem for us because we have exactly one tag per token. If the tokenizer splits a token into multiple sub-tokens, then we will end up with a mismatch between our tokens and our labels.

One way to handle this is to only train on the tag labels for the first subtoken of a split token. We can do this in 🤗 Transformers by setting the labels we wish to ignore to -100. In the example above, if the label for @HuggingFace is 3 (indexing B-corporation), we would set the labels of ['@', 'hugging', '##face'] to [3, -100, -100].

Let’s write a function to do this. This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like [PAD] or [CLS].

I do appreciate your time and sharing.

ebanalyse / NERDA

Discussion about Data Labelling in the Vietnamese language #19