Closed ERijck closed 2 years ago
@ERijck I faced the similar error while using a function preprocessor.preprocess_dataset()
that comes with OCTIS. I found out that my dataset has some unicode characters and emojis.
The error you are facing comes from line-number-(77) in downloader.py
where OCTIS is trying to create relevant files of dataset. Link here: https://github.com/MIND-Lab/OCTIS/blob/1e8e6be5040b38cf3c458ece4327886dee8568ef/octis/dataset/downloader.py#L75
A quickfix would be to read/write corpus
that allows unicode characters and looks like following:
with open(corpus_path, 'w', encoding='utf8') as f:
f.write(corpus.text)
Two quick modifications:
NOTE: If you endup doing this modification by yourself before OCTIS, you also need to change some appropriate functions which are reading the "Dataset" files before you start training model.
Hi, thanks @ERijck for reporting and @Ravi2712 for your suggestion. It is indeed a problem of encoding. I'll fix this in the next release.
Silvia
Description
I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.
What I Did