abhimishra91 / transformers-tutorials

Github repo with tutorials to fine tune transformers for diff NLP tasks
MIT License
817 stars 193 forks source link

Tokenization issue in transformer NER #22

Open mukesh-mehta opened 3 years ago

mukesh-mehta commented 3 years ago

In your custom data loader:

class CustomDataset(Dataset):
    def __init__(self, tokenizer, sentences, labels, max_len):
        self.len = len(sentences)
        self.sentences = sentences
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __getitem__(self, index):
        sentence = str(self.sentences[index])
        inputs = self.tokenizer.encode_plus(
            sentence,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        label = self.labels[index]
        label.extend([4]*200)
        label=label[:200]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'tags': torch.tensor(label, dtype=torch.long)
        } 

    def __len__(self):
        return self.len

according to my understanding: you have a sentence say w1 w2 w3 w4, and its BIO label is O B-class1 I-class1 O. once you encode your sentence using tokenizer it will use word piece and split your words into subwords, therefore making it more longer and you are padding it to some 200 length(lets say upto 10) say w1-a w1-b w2 w3-a w3-b w4 [PAD] [PAD] [PAD] [PAD], but your labels are O B-class1 I-class1 O 4 4 4 4 4 4. So, now you are passing incorrect labels to your model.

mukesh-mehta commented 3 years ago

https://github.com/huggingface/transformers/blob/v2.2.2/examples/utils_ner.py#L116 hugging face NER dataloader example

mukesh-mehta commented 3 years ago

I have found the correct implementation, you can modify your code accordingly. https://github.com/huggingface/notebooks/blob/master/examples/token_classification.ipynb

QuetzalcoatlRosso commented 3 years ago

@mukesh-mehta : could you submit a pull request with your suggested implementation for class CustomDataset?

mukesh-mehta commented 3 years ago

sure, will do it.