UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset

ERijck commented 2 years ago

OCTIS version: 1.10.3
Python version: 3.9
Operating System: Windows

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>

Ravi2712 commented 2 years ago

@ERijck I faced the similar error while using a function preprocessor.preprocess_dataset() that comes with OCTIS. I found out that my dataset has some unicode characters and emojis.

The error you are facing comes from line-number-(77) in downloader.py where OCTIS is trying to create relevant files of dataset. Link here: https://github.com/MIND-Lab/OCTIS/blob/1e8e6be5040b38cf3c458ece4327886dee8568ef/octis/dataset/downloader.py#L75

A quickfix would be to read/write corpus that allows unicode characters and looks like following:

with open(corpus_path, 'w', encoding='utf8') as f:
            f.write(corpus.text)

Two quick modifications:

You can try to fork the repository and change these lines until OCTIS provides Unicode support for dataset.
You can edit the the files present in the environment (Not recommended).

NOTE: If you endup doing this modification by yourself before OCTIS, you also need to change some appropriate functions which are reading the "Dataset" files before you start training model.

silviatti commented 2 years ago

Hi, thanks @ERijck for reporting and @Ravi2712 for your suggestion. It is indeed a problem of encoding. I'll fix this in the next release.

Silvia

MIND-Lab / OCTIS

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

Description

What I Did