flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)
https://flairnlp.github.io/flair/
Other
13.97k stars 2.1k forks source link

'UnicodeDecodeError' when using functions defined in data_fetcher.py #67

Closed iamyihwa closed 6 years ago

iamyihwa commented 6 years ago

Hello

I was following steps given in the documentations in here.

Situation: I am trying to train a Spanish NER. The data I am using to train is similar is a CONLL format. The error happened when I was testing a few options to load the input files. I have installed flair through git clone and installed from master branch. I first wanted to install via pip install however I wanted to train my own embedding and according to this it seems a feature available only through the git clone not from pip install.

When the error occurred: When I ran the function to load the file data = NLPTaskDataFetcher.read_conll_sequence_labeling_data('./data/esp2.train') I got an error that said:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 291: ordinal not in range(128)

Screenshot of the error image

What I tried: I have tried (1) different functions implemented in the file data_fetcher. (2) also gave option of encoding to 'utf-8' in open(path_to_conll_file, encoding='utf-8') , but didn't solve the issue.

I have also tried a couple of options to change system encoding but not so successful.

alanakbik commented 6 years ago

Hello Yihwa,

thanks for your interest! the error looks like some strange encoding problem. Could you share the data (or post a few sentences from it here) so we can try to reproduce the error?

iamyihwa commented 6 years ago

Sure!

It is attached. It is Spanish words with NER labels. (from CONLL 2002) esp.txt

alanakbik commented 6 years ago

Hello Yihwa,

I cannot reproduce the error. Could you do a git pull to get the latest code in the master branch (we just refactored the NLPTaskDataFetcher) and do the following:

from flair.data_fetcher import NLPTaskDataFetcher

sentences = NLPTaskDataFetcher.read_column_data('/path/to/esp.txt', column_name_map={0: 'text', 3: 'ner'})

for sentence in sentences:
    print(sentence.to_tagged_string())

Does this also throw the errors? What operating system do you use?

iamyihwa commented 6 years ago

I did! Get the same error. I am using ubuntu.

image

image

stefan-it commented 6 years ago

What is the output of locale on your commandline?

stefan-it commented 6 years ago

I could reproduce the error with Anaconda:

$ export LANG=C
$ python esp.py
Traceback (most recent call last):
  File "esp.py", line 3, in <module>
    sentences = NLPTaskDataFetcher.read_column_data('/home/stefan/Downloads/esp.txt', column_name_map={0: 'text', 3: 'ner'})
  File "/home/stefan/Repositories/github.com/flair/flair/data_fetcher.py", line 229, in read_column_data
    lines: List[str] = open(path_to_column_file).read().strip().split('\n')
  File "/tmp/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 291: ordinal not in range(128)

So you have to make sure that you're using an utf-8 capable locale, like C.UTF-8 or es_ES.UTF-8 :)

iamyihwa commented 6 years ago

The system does have UTF-8 encoding image

However anaconda does not seem to use that ... image

I did a few attempts to change system encoding ... image

However even after that i get the same error ..

image

I tried on the mac, and the error doesn't seem to appear...

iamyihwa commented 6 years ago

adding , encoding='utf-8' to open () in data_fetcher.py , and reloading jupyter notebook kernel, solved the problem.

This morning it didn't work .. for some reason .. I wonder if it is due to any update / my environment change / or not reloading jupyter notebook.. but issue is solved now! Thanks! :-)

manibt1992 commented 1 year ago

I faced "utf-8' codec can't decode byte 0xa0" error when initiate the flair retraining

Then I open the input file in notepad and did the save as with "UTF-8" option. Then I rerun the script, it works for me