transformers for sequential tagging

antgr commented 4 years ago

Although there is an example of transformers in other repo (for sentiment analysis) and is easy to adapt it to other cases, I think for sequential tagging is a bit challenging due to the fact that Bert tokenizer splits the words further, which breaks the equality of length of words and labels. Would you consider to add an example with Bert for sequential tagging (e.g pos tagging) ?

bentrevett commented 4 years ago

Yeah, the issue is that the words and labels have to exactly align, hence you cannot retokenize the sequences. You have to just feed the sequences, tokenized as they already are, through BERT.

I'll work on a quick example.

antgr commented 4 years ago

Isn't this an issue, due to the fact (?) that bert needs the tokenization that understands? Does that means that many words will be considered unknown words, although there is a similar word in its vocab?

bentrevett commented 4 years ago

Yes, unfortunately words that aren't in BERTs vocab will be converted to unks, and as your tag vocabulary is created using a different corpus from the one the BERT vocabulary was created on then there will be more unks than usual.

I'm just finishing up the pre-trained BERT notebook, will commit in a few minutes.

bentrevett commented 4 years ago

@antgr Now available here

Some things to note:

both the text and the tags need to be cut to the maximum length or else your predictions/labels won't align
need to add sos and eos tokens to the tags as the text has them too, we just add pad tokens here as they will be ignored by the loss/accuracy calculations (BERT does not like it when you don't append the sos and eos tokens it was trained with - performance drops dramatically)
~~GRU seemed to work a lot better than LSTM for some reason~~ OK nevermind, they seem to give the same performance with LSTM having slightly higher test accuracy
test accuracy is lower than notebook 2, however I have not tweaked any of the parameters except increasing the batch size ~~(BERT seems to like bigger batch sizes and this was the biggest that fit on my GPU)~~ seems like batch sizes of 64, 128, 256 and 512, all seem to give similar results, with 128 having the slight edge, again this is without any tweaking of any hyperparameters and all using a single seed for each batch size.

bentrevett commented 4 years ago

The transformers library does have a way to add new tokens to the BERT vocabulary, but they will obviously be initialized randomly and need to be fine-tuned with the full BERT model.

If the full BERT model does not fit on your GPU (it doesn't fit on mine) then you'll have to accept the unks.

antgr commented 4 years ago

Thanks! Great. For the OUTPUT_DIM we include the sos and eos (which both are the '\<pad>'), and we just ignore them in categorical_accuracy. Right? I also raised a question in simpletransformers repo (issue 67). The approach there, as described, is the following:

    tokens = []
    label_ids = []
    for word, label in zip(example.words, example.labels):
        word_tokens = tokenizer.tokenize(word)
        tokens.extend(word_tokens)
        # Use the real label id for the first token of the word, and padding ids for the remaining tokens
        label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))

The transformers library does have a way to add new tokens to the BERT vocabulary I didn't know that..

bentrevett commented 4 years ago

Yep, the output_dim needs to include the <pad> tokens as a valid output - but when the target is a <pad> token it is never used to calculate the loss or accuracy so shouldn't mess with the training/metrics.

New tokens can be added to the BERT tokenizer with add_tokens and then added to the model with resize_token_embeddings, see: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens

antgr commented 4 years ago

Thanks. Also, for the issue with the size of bert and with freezing (not train) bert layers, I have also seen a trade off like this one: https://github.com/WindChimeRan/pytorch_multi_head_selection_re/blob/30c8aa1b2d89a115612d2c5f344e96ab132220c8/lib/models/selection.py#L60 where only the last layer remains trainable. But maybe even this is too expensive in gpu memory.

bentrevett commented 4 years ago

@antgr That's interesting - not seen it before. I'll try it out sometime this month and see if it improves things.

littleflow3r commented 4 years ago

Hey, can we just re-tokenize the text using BertTokenizer and align the labels? Suppose our original data is like below (IOB format), words1 B-label1 e.g. "words1" is tokenized as "word" and "s1", we can align the label to, e.g. "B-label1" "I-label1"

Is there any problem with this kind of approach?

bentrevett commented 4 years ago

I think that should work alright, i.e. if you have "New York" as a NOUN then you can split it into "New" and "York" and should then get NOUN and NOUN.

The real issue is that when doing inference ideally your text should be tokenized with the same tokenizer used for the dataset. I don't think the tokenizer used to make the dataset is even publicly available, hence you have to use own our tokenizers that cause weird misalignment.

littleflow3r commented 4 years ago

Can't we just use the same BertTokenizer for inference? I mean, using BertTokenizer.tokenize(sentence) ? Kind of what you did in the tutorial,

if isinstance(sentence, str):
        tokens = tokenizer.tokenize(sentence)
    else:
        tokens = sentence #of course we removed this or tokens = tokenizer.tokenize(' '.join(sentence))

Or did I misunderstand something?

Thanks

bentrevett commented 4 years ago

Yes, you can, and yes that is what we did.

But ideally, we should tokenize with the same tokenizer used to create the dataset - which is not available, so we have to make do.

littleflow3r commented 3 years ago

Hey, I I tried both tutorials in a sequential tagging task on my data, and with the transformer (second tutorial) it seems to be underfitting and the accuracy dropped to around 0.6 (with simple BiLSTM-CRF I got > 0.85). Any idea where I did wrong?

bentrevett commented 3 years ago

Is that training, validation or testing accuracy? Your model might be overfitting.

littleflow3r commented 3 years ago

Actually for both training and testing (didnt do the validation) the accuracy was bad. For the training, the accuracy stuck on abt 0.68, that’s why I thought it was underfitting. While for the testing the acc was abt 0.6. Tried changing the lr, dropout but didnt change much.

bentrevett commented 3 years ago

Are you using the bert-base-uncased model? I think there a lot better alternatives available through the transformers library now, with models actually trained for NER/POS tagging.

I would try one of the models from: https://huggingface.co/models?pipeline_tag=token-classification

What does your data look like? Is it english? How long are the sentences? How many tags do you have? Are your tags balanced? Is it easy for a human to accurately tag the sentences?

littleflow3r commented 3 years ago

Hey, thanks for the response! I actually did try using Bert, Scibert, and also some of the models from your link. The results didn't change much. My data is in English, it is domain-specific data, and the length is standard, I guess, not particularly short or long. The tags are not balanced, and even for human I think it is a difficult task. What makes me wonder is that I got much better accuracy with simple BiLSTM (abt 0.8), compared to BERT (0.6).

Btw, it is not a NER task, more similar to Semantic Role Labeling task where the input is not only the sequences but also a one-hot vector indicates the position of the predicate in the sentence.

e.g.

She | 0 | B-Arg1
likes | 1 | Predicate
banana | 0 | B-Arg2
. | 0 | O

So basically with BERT I concated the Bert Embedding and that one-hot vector before the linear layer. Am I doing it wrong? x = torch.cat((bert_emb, pred), dim=2)

bentrevett / pytorch-pos-tagging

transformers for sequential tagging #3