flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.8k stars 2.09k forks source link

Tokenization MISMATCH causes Runtime Error (keyphrase tagger model) #1672

Closed whoisjones closed 4 years ago

whoisjones commented 4 years ago

While training a tagging model using the provided Keyphrase dataset SEMEVAL2017, following error comes up:

Tokenization MISMATCH in sentence '...' Last matched: 'Token: 272 ̄' Last sentence: 'Token: 325 .'

which causes directly afterwards a Runtime Error during forward method in sequence_tagger_model.py: File "xxx/PycharmProjects/flair/flair/models/sequence_tagger_model.py", line 541, in forward self.embeddings.embedding_length, RuntimeError: shape '[16, 326, 3072]' is invalid for input of size 15857664

To reproduce, use a tagger tutorial code from flair. I use TransformerWordEmbeddings() with scibert or bert-base-uncased as embedding_types. Error does not comes up for example when training the tagger with flair embeddings.

Expected is that the trainer can complete training without stopping due to special characters in the sentences.

Environment:

whoisjones commented 4 years ago

A list of sentences doesn't working:

alanakbik commented 4 years ago

@whoisjones thanks for reporting this. I've just merged a PR that should fix this.

Krishnkant-Swarnkar commented 4 years ago

Just ran into a similar issue...

Tokenization MISMATCH in sentence 'i want toi hear some pop punk perfection ������ off of deezer'
'Token: 9 ������'
Last sentence: 'Token: 12 deezer'
subtokenized: '['[CLS]', 'i', 'want', 'to', '##i', 'hear', 'some', 'pop', 'punk', 'perfection', 'off', 'of', 'dee', '##zer', '[SEP]']'
alanakbik commented 4 years ago

Could you try with latest master branch?

philfuchs commented 3 years ago

@alanakbik I have version flair==0.8.0.post1 installed and have the same problem with the tokenization mismatch (see also #1699). Has a fix been found? I'm using TransformerWordEmbeddings (bert-base-german-cased) as embeddings and a SequenceTagger with rnn=False and crf=False.

2021-08-17 14:40:59,407 Embeddings storage mode: gpu
2021-08-17 14:40:59,409 ----------------------------------------------------------------------------------------------------
2021-08-17 14:41:18,334 epoch 1 - iter 34/340 - loss 0.77973957 - samples/sec: 57.49 - lr: 0.100000
2021-08-17 14:41:34,041 Tokenization MISMATCH in sentence '[REDACTED]... MS-Office Outlook Englisch Organisation         '
2021-08-17 14:41:34,041 Last matched: 'Token: 41 '
2021-08-17 14:41:34,041 Last sentence: 'Token: 49 '
2021-08-17 14:41:34,041 subtokenized: '[[REDACTED]... '##MS', '-', 'Office', 'Out', '##lo', '##ok', 'Englisch', 'Organisation']'
Traceback (most recent call last):
  File "train_transformer.py", line 56, in <module>
    trainer.train(path,
  File "/home/phillip.fuchs/anaconda3/envs/flairtest/lib/python3.8/site-packages/flair/trainers/trainer.py", line 381, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/phillip.fuchs/anaconda3/envs/flairtest/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 637, in forward_loss
    features = self.forward(data_points)
  File "/home/phillip.fuchs/anaconda3/envs/flairtest/lib/python3.8/site-packages/flair/models/sequence_tagger_model.py", line 668, in forward
    sentence_tensor = torch.cat(all_embs).view(
RuntimeError: shape '[32, 78, 768]' is invalid for input of size 1910016

Same error for storage_mode='cpu'.