Open antgr opened 4 years ago
Yeah, the issue is that the words and labels have to exactly align, hence you cannot retokenize the sequences. You have to just feed the sequences, tokenized as they already are, through BERT.
I'll work on a quick example.
Isn't this an issue, due to the fact (?) that bert needs the tokenization that understands? Does that means that many words will be considered unknown words, although there is a similar word in its vocab?
Yes, unfortunately words that aren't in BERTs vocab will be converted to unks, and as your tag vocabulary is created using a different corpus from the one the BERT vocabulary was created on then there will be more unks than usual.
I'm just finishing up the pre-trained BERT notebook, will commit in a few minutes.
@antgr Now available here
Some things to note:
The transformers library does have a way to add new tokens to the BERT vocabulary, but they will obviously be initialized randomly and need to be fine-tuned with the full BERT model.
If the full BERT model does not fit on your GPU (it doesn't fit on mine) then you'll have to accept the unks.
Thanks! Great. For the OUTPUT_DIM we include the sos and eos (which both are the '\<pad>'), and we just ignore them in categorical_accuracy. Right? I also raised a question in simpletransformers repo (issue 67). The approach there, as described, is the following:
tokens = []
label_ids = []
for word, label in zip(example.words, example.labels):
word_tokens = tokenizer.tokenize(word)
tokens.extend(word_tokens)
# Use the real label id for the first token of the word, and padding ids for the remaining tokens
label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1))
The transformers library does have a way to add new tokens to the BERT vocabulary
I didn't know that..
Yep, the output_dim
needs to include the <pad>
tokens as a valid output - but when the target is a <pad>
token it is never used to calculate the loss or accuracy so shouldn't mess with the training/metrics.
New tokens can be added to the BERT tokenizer with add_tokens
and then added to the model with resize_token_embeddings
, see: https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer.add_tokens
Thanks. Also, for the issue with the size of bert and with freezing (not train) bert layers, I have also seen a trade off like this one: https://github.com/WindChimeRan/pytorch_multi_head_selection_re/blob/30c8aa1b2d89a115612d2c5f344e96ab132220c8/lib/models/selection.py#L60 where only the last layer remains trainable. But maybe even this is too expensive in gpu memory.
@antgr That's interesting - not seen it before. I'll try it out sometime this month and see if it improves things.
Hey, can we just re-tokenize the text using BertTokenizer and align the labels? Suppose our original data is like below (IOB format), words1 B-label1 e.g. "words1" is tokenized as "word" and "s1", we can align the label to, e.g. "B-label1" "I-label1"
Is there any problem with this kind of approach?
I think that should work alright, i.e. if you have "New York" as a NOUN then you can split it into "New" and "York" and should then get NOUN and NOUN.
The real issue is that when doing inference ideally your text should be tokenized with the same tokenizer used for the dataset. I don't think the tokenizer used to make the dataset is even publicly available, hence you have to use own our tokenizers that cause weird misalignment.
Can't we just use the same BertTokenizer for inference? I mean, using BertTokenizer.tokenize(sentence) ? Kind of what you did in the tutorial,
if isinstance(sentence, str):
tokens = tokenizer.tokenize(sentence)
else:
tokens = sentence #of course we removed this or tokens = tokenizer.tokenize(' '.join(sentence))
Or did I misunderstand something?
Thanks
Yes, you can, and yes that is what we did.
But ideally, we should tokenize with the same tokenizer used to create the dataset - which is not available, so we have to make do.
Hey, I I tried both tutorials in a sequential tagging task on my data, and with the transformer (second tutorial) it seems to be underfitting and the accuracy dropped to around 0.6 (with simple BiLSTM-CRF I got > 0.85). Any idea where I did wrong?
Is that training, validation or testing accuracy? Your model might be overfitting.
Actually for both training and testing (didnt do the validation) the accuracy was bad. For the training, the accuracy stuck on abt 0.68, that’s why I thought it was underfitting. While for the testing the acc was abt 0.6. Tried changing the lr, dropout but didnt change much.
Are you using the bert-base-uncased
model? I think there a lot better alternatives available through the transformers
library now, with models actually trained for NER/POS tagging.
I would try one of the models from: https://huggingface.co/models?pipeline_tag=token-classification
What does your data look like? Is it english? How long are the sentences? How many tags do you have? Are your tags balanced? Is it easy for a human to accurately tag the sentences?
Hey, thanks for the response! I actually did try using Bert, Scibert, and also some of the models from your link. The results didn't change much. My data is in English, it is domain-specific data, and the length is standard, I guess, not particularly short or long. The tags are not balanced, and even for human I think it is a difficult task. What makes me wonder is that I got much better accuracy with simple BiLSTM (abt 0.8), compared to BERT (0.6).
Btw, it is not a NER task, more similar to Semantic Role Labeling task where the input is not only the sequences but also a one-hot vector indicates the position of the predicate in the sentence.
e.g.
She | 0 | B-Arg1
likes | 1 | Predicate
banana | 0 | B-Arg2
. | 0 | O
So basically with BERT I concated the Bert Embedding and that one-hot vector before the linear layer. Am I doing it wrong?
x = torch.cat((bert_emb, pred), dim=2)
Although there is an example of transformers in other repo (for sentiment analysis) and is easy to adapt it to other cases, I think for sequential tagging is a bit challenging due to the fact that Bert tokenizer splits the words further, which breaks the equality of length of words and labels. Would you consider to add an example with Bert for sequential tagging (e.g pos tagging) ?