Closed iamyihwa closed 6 years ago
Hello Yihwa,
thanks for your interest! the error looks like some strange encoding problem. Could you share the data (or post a few sentences from it here) so we can try to reproduce the error?
Sure!
It is attached. It is Spanish words with NER labels. (from CONLL 2002) esp.txt
Hello Yihwa,
I cannot reproduce the error. Could you do a git pull
to get the latest code in the master branch (we just refactored the NLPTaskDataFetcher
) and do the following:
from flair.data_fetcher import NLPTaskDataFetcher
sentences = NLPTaskDataFetcher.read_column_data('/path/to/esp.txt', column_name_map={0: 'text', 3: 'ner'})
for sentence in sentences:
print(sentence.to_tagged_string())
Does this also throw the errors? What operating system do you use?
I did! Get the same error. I am using ubuntu.
What is the output of locale
on your commandline?
I could reproduce the error with Anaconda:
$ export LANG=C
$ python esp.py
Traceback (most recent call last):
File "esp.py", line 3, in <module>
sentences = NLPTaskDataFetcher.read_column_data('/home/stefan/Downloads/esp.txt', column_name_map={0: 'text', 3: 'ner'})
File "/home/stefan/Repositories/github.com/flair/flair/data_fetcher.py", line 229, in read_column_data
lines: List[str] = open(path_to_column_file).read().strip().split('\n')
File "/tmp/anaconda3/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 291: ordinal not in range(128)
So you have to make sure that you're using an utf-8 capable locale, like C.UTF-8
or es_ES.UTF-8
:)
The system does have UTF-8 encoding
However anaconda does not seem to use that ...
I did a few attempts to change system encoding ...
However even after that i get the same error ..
I tried on the mac, and the error doesn't seem to appear...
adding , encoding='utf-8' to open () in data_fetcher.py , and reloading jupyter notebook kernel, solved the problem.
This morning it didn't work .. for some reason .. I wonder if it is due to any update / my environment change / or not reloading jupyter notebook.. but issue is solved now! Thanks! :-)
I faced "utf-8' codec can't decode byte 0xa0" error when initiate the flair retraining
Then I open the input file in notepad and did the save as with "UTF-8" option. Then I rerun the script, it works for me
Hello
I was following steps given in the documentations in here.
Situation: I am trying to train a Spanish NER. The data I am using to train is similar is a CONLL format. The error happened when I was testing a few options to load the input files. I have installed flair through git clone and installed from master branch. I first wanted to install via pip install however I wanted to train my own embedding and according to this it seems a feature available only through the git clone not from pip install.
When the error occurred: When I ran the function to load the file
data = NLPTaskDataFetcher.read_conll_sequence_labeling_data('./data/esp2.train')
I got an error that said:UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 291: ordinal not in range(128)
Screenshot of the error
What I tried: I have tried (1) different functions implemented in the file data_fetcher. (2) also gave option of encoding to 'utf-8' in
open(path_to_conll_file, encoding='utf-8')
, but didn't solve the issue.I have also tried a couple of options to change system encoding but not so successful.