flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.86k stars 2.1k forks source link

[Bug]: RuntimeError: shape '[8, 89, 1024]' is invalid for input of size 705536 #3257

Open Guust-Franssens opened 1 year ago

Guust-Franssens commented 1 year ago

Describe the bug

During NER model training with TransformerWordEmbedding I run into a RunTimeError for one of my three models.

For a project I train three NER models, only one of which runs into this issue. This makes me think it is a data issue rather than a code issue. Perhaps a fix is needed in the Corpus instead of the training script.

Bug occurs at the following lines: https://github.com/flairNLP/flair/blob/b1a3e24ddec85ce62e007e1d44f8a9419215393d/flair/models/sequence_tagger_model.py#L366-L372

printing the sentences at this stage gives:

pprint.pprint([sentence.text for sentence in sentences])
['1',
 'i',
 'I l »',
 '« I » I l',
 '[DOCSEP]',
 '\ufeff',
 'N ° d’ entreprise : Nom ( en entier ) ; ( en abrégé ) : Forme légale : '
 'Adresse complète du siège : ThaïBoxing Discovery Association Sans But '
 'Lucratif Place Terdelt , 2 boîte 11 à 1030 Schaerbeek Objet de l’ acte : '
 "Constitution d' une ASBL { Association Sans But Lucratif ) Texte Les "
 'soussignés : 1 , \t BUNRAD Anatpong , domicilié Rue de la Roche , 1 à 1301 '
 'Bîerges , né à Udon Thani ( Thaïlande ) le 14 juin 1986 ; 2 .',
 'DUBREUCQ David , domicilié Place Terdelt 2 boîte 11 à 1030 Schaerbeek , né à '
 'üccle le 14 décembre 1977 ; 3 .']

It seems that there is an empty string sentence with the character \ufeff

Googling finds me that changing the encoding to 'utf-8-sig' could remove this character. https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string

Perhaps this bug is similar to https://github.com/flairNLP/flair/issues/1600 where this character is a 0 width?

This looks like it when seeing the training data: image

To Reproduce

train a trainer using TransformerWordEmbedding and the sentences above

Expected behavior

training as normal

Logs and Stack traces

2023-05-31 04:28:38 : │ ❱ 62 │   │   trainer.fine_tune(                                              │
2023-05-31 04:28:38 : │   63 │   │   │   output_folder,                                              │
2023-05-31 04:28:38 : │   64 │   │   │   learning_rate=training_args.learning_rate,                  │
2023-05-31 04:28:38 : │   65 │   │   │   mini_batch_size=training_args.mini_batch_size,              │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/trainers/trainer.py:899 in fine_tune                                   │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │    896 │   │   **trainer_args,                                               │
2023-05-31 04:28:38 : │    897 │   ):                                                                │
2023-05-31 04:28:38 : │    898 │   │                                                                 │
2023-05-31 04:28:38 : │ ❱  899 │   │   return self.train(                                            │
2023-05-31 04:28:38 : │    900 │   │   │   base_path=base_path,                                      │
2023-05-31 04:28:38 : │    901 │   │   │   learning_rate=learning_rate,                              │
2023-05-31 04:28:38 : │    902 │   │   │   max_epochs=max_epochs,                                    │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/trainers/trainer.py:500 in train                                       │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │    497 │   │   │   │   │   for batch_step in batch_steps:                    │
2023-05-31 04:28:38 : │    498 │   │   │   │   │   │                                                 │
2023-05-31 04:28:38 : │    499 │   │   │   │   │   │   # forward pass                                │
2023-05-31 04:28:38 : │ ❱  500 │   │   │   │   │   │   loss = self.model.forward_loss(batch_step)    │
2023-05-31 04:28:38 : │    501 │   │   │   │   │   │                                                 │
2023-05-31 04:28:38 : │    502 │   │   │   │   │   │   if isinstance(loss, tuple):                   │
2023-05-31 04:28:38 : │    503 │   │   │   │   │   │   │   average_over += loss[1]                   │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:270 in forward_loss                    │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │    267 │   │   │   return torch.tensor(0.0, dtype=torch.float, device=flair. │
2023-05-31 04:28:38 : │    268 │   │                                                                 │
2023-05-31 04:28:38 : │    269 │   │   # forward pass to get scores                                  │
2023-05-31 04:28:38 : │ ❱  270 │   │   scores, gold_labels = self.forward(sentences)  # type: ignore │
2023-05-31 04:28:38 : │    271 │   │                                                                 │
2023-05-31 04:28:38 : │    272 │   │   # calculate loss given scores and labels                      │
2023-05-31 04:28:38 : │    273 │   │   return self._calculate_loss(scores, gold_labels)              │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:285 in forward                         │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │    282 │   │   self.embeddings.embed(sentences)                              │
2023-05-31 04:28:38 : │    283 │   │                                                                 │
2023-05-31 04:28:38 : │    284 │   │   # make a zero-padded tensor for the whole sentence            │
2023-05-31 04:28:38 : │ ❱  285 │   │   lengths, sentence_tensor = self._make_padded_tensor_for_batch │
2023-05-31 04:28:38 : │    286 │   │                                                                 │
2023-05-31 04:28:38 : │    287 │   │   # sort tensor in decreasing order based on lengths of sentenc │
2023-05-31 04:28:38 : │    288 │   │   sorted_lengths, length_indices = lengths.sort(dim=0, descendi │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │ /opt/conda/envs/Fortis-Named-Entity-Recognition/lib/python3.9/site-packages/ │
2023-05-31 04:28:38 : │ flair/models/sequence_tagger_model.py:366 in _make_padded_tensor_for_batch   │
2023-05-31 04:28:38 : │                                                                              │
2023-05-31 04:28:38 : │    363 │   │   │   │   t = pre_allocated_zero_tensor[: self.embeddings.embed │
2023-05-31 04:28:38 : │    364 │   │   │   │   all_embs.append(t)                                    │
2023-05-31 04:28:38 : │    365 │   │                                                                 │
2023-05-31 04:28:38 : │ ❱  366 │   │   sentence_tensor = torch.cat(all_embs).view(                   │
2023-05-31 04:28:38 : │    367 │   │   │   [                                                         │
2023-05-31 04:28:38 : │    368 │   │   │   │   len(sentences),                                       │
2023-05-31 04:28:38 : │    369 │   │   │   │   longest_token_sequence_in_batch,                      │
2023-05-31 04:28:38 : ╰────────────────────────────────────────────────────────────────

Screenshots

No response

Additional Context

No response

Environment

Flair version 0.11.3 Torch version 1.13.1 Transformers version 4.29.2

Guust-Franssens commented 1 year ago

@alanakbik I managed to resolve it by getting rid of this invisible \ufeff character.

Could this token also be removed if it's part of the sentence

e.g.

George B-PER
Washington E-PER
\ufeff O
went O
to O
Washington S-LOC

becomes:

George B-PER
Washington E-PER
went O
to O
Washington S-LOC

I do not know where to modify this in the repo otherwise I would do a PR

helpmefindaname commented 1 year ago

Hi @Guust-Franssens seeing that you are not on the latest version, do you want to try this again on the master branch? There are already some improvements on similar issues and I think yours could also be solved already.