MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
734 stars 106 forks source link

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined> - when fetching the DBPedia_IT dataset #57

Closed ERijck closed 2 years ago

ERijck commented 2 years ago

Description

I am trying to fetch the DBPedia_IT dataset. I expected nothing to happen, but an UnicodeEncodeError was raised.

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.fetch_dataset('DBPedia_IT')

Traceback (most recent call last):

  Input In [42] in <module>
    dataset.fetch_dataset('DBPedia_IT')

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\dataset.py:392 in fetch_dataset
    cache = download_dataset(dataset_name, target_dir=dataset_home, cache_path=cache_path)

  File ~\Anaconda3\envs\SentenceTransformers\lib\site-packages\octis\dataset\downloader.py:77 in download_dataset
    f.write(corpus.text)

  File ~\Anaconda3\envs\SentenceTransformers\lib\encodings\cp1252.py:19 in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]

UnicodeEncodeError: 'charmap' codec can't encode characters in position 31758-31761: character maps to <undefined>
Ravi2712 commented 2 years ago

@ERijck I faced the similar error while using a function preprocessor.preprocess_dataset() that comes with OCTIS. I found out that my dataset has some unicode characters and emojis.

The error you are facing comes from line-number-(77) in downloader.py where OCTIS is trying to create relevant files of dataset. Link here: https://github.com/MIND-Lab/OCTIS/blob/1e8e6be5040b38cf3c458ece4327886dee8568ef/octis/dataset/downloader.py#L75

A quickfix would be to read/write corpus that allows unicode characters and looks like following:

with open(corpus_path, 'w', encoding='utf8') as f:
            f.write(corpus.text)

Two quick modifications:

NOTE: If you endup doing this modification by yourself before OCTIS, you also need to change some appropriate functions which are reading the "Dataset" files before you start training model.

silviatti commented 2 years ago

Hi, thanks @ERijck for reporting and @Ravi2712 for your suggestion. It is indeed a problem of encoding. I'll fix this in the next release.

Silvia