Open Guust-Franssens opened 1 year ago
@alanakbik I managed to resolve it by getting rid of this invisible \ufeff character.
Could this token also be removed if it's part of the sentence
e.g.
George B-PER
Washington E-PER
\ufeff O
went O
to O
Washington S-LOC
becomes:
George B-PER
Washington E-PER
went O
to O
Washington S-LOC
I do not know where to modify this in the repo otherwise I would do a PR
Hi @Guust-Franssens seeing that you are not on the latest version, do you want to try this again on the master branch? There are already some improvements on similar issues and I think yours could also be solved already.
Describe the bug
During NER model training with TransformerWordEmbedding I run into a RunTimeError for one of my three models.
For a project I train three NER models, only one of which runs into this issue. This makes me think it is a data issue rather than a code issue. Perhaps a fix is needed in the Corpus instead of the training script.
Bug occurs at the following lines: https://github.com/flairNLP/flair/blob/b1a3e24ddec85ce62e007e1d44f8a9419215393d/flair/models/sequence_tagger_model.py#L366-L372
printing the sentences at this stage gives:
It seems that there is an empty string sentence with the character \ufeff
Googling finds me that changing the encoding to 'utf-8-sig' could remove this character. https://stackoverflow.com/questions/17912307/u-ufeff-in-python-string
Perhaps this bug is similar to https://github.com/flairNLP/flair/issues/1600 where this character is a 0 width?
This looks like it when seeing the training data:
To Reproduce
Expected behavior
training as normal
Logs and Stack traces
Screenshots
No response
Additional Context
No response
Environment
Flair version 0.11.3 Torch version 1.13.1 Transformers version 4.29.2