MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

AttributeError: 'list' object has no attribute 'lower' preprocessor.preprocess_dataset when num_processes != None #99

Open p-dre opened 1 year ago

p-dre commented 1 year ago

OCTIS version: 1.11.0 Python version: 3.8.15 Operating System: 'posix'

Description - What I Did

I read in my own data and save it as .txt with one document per line. Then I define the preprocessing and execute it via preprocessor.preprocess_dataset. The error message is AttributeError: 'list' object has no attribute 'lower'. If I set no num_processes all is working.

The loop in simple_preprocessing_steps in combination with process_map breaks the documents into letters. See below


import os
import string
from octis.preprocessing.preprocessing import Preprocessing
import pandas as pd

docs = pd.read_csv('tweets.csv',lineterminator='\n')
docs['clean_tweets'].to_csv('documents.txt', header=None,  sep='\n', mode='w', encoding="utf-8")

preprocessor = Preprocessing( max_features=None,
                             remove_punctuation=True, punctuation=string.punctuation,
                             lemmatize=True, stopword_list='german',
                             min_chars=1, min_words_docs=0,  language= 'german', split = False, num_processes= 36, max_df= 0.9, min_df = 0.05)
# preprocess
dataset = preprocessor.preprocess_dataset(documents_path='documents.txt')

Traceback (most recent call last):
  File "/home/p/p_drec01/lda/preprocess_lda_test.py", line 40, in <module>
    dataset = preprocessor.preprocess_dataset(documents_path='/scratch/tmp/p_drec01/lda/octis_data/documents.txt')
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 171, in preprocess_dataset
    vocabulary = self.filter_words(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/octis/preprocessing/preprocessing.py", line 290, in filter_words
    vectorizer.fit_transform(docs)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1846, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1202, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1114, in _count_vocab
    for feature in analyze(doc):
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 104, in _analyze
    doc = preprocessor(doc)
  File "/home/p/p_drec01/miniconda3/envs/octis_env/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 69, in _preprocess
    doc = doc.lower()
AttributeError: 'list' object has no attribute 'lower'

 ##############
documents_path = 'documents.txt'
docs2 = [line.strip() for line in open(documents_path, 'r').readlines()]

def simple_preprocessing_steps( docs):
        tmp_docs = []
        for d in docs:
            print(d)

docs2 = process_map(simple_preprocessing_steps, docs2, max_workers=16, chunksize=1)

Ü
b
e
r

6
"

U
M
n

etc.
Edilson-R commented 1 year ago

I have the same problem. Load a custom dataset.

Python 3.10.11 OCTIS 1.12.1 System: Windows 10

Code: import os import string import spacy from octis.preprocessing.preprocessing import Preprocessing

preprocessor = Preprocessing(lowercase = True, vocabulary = None, max_features = None, remove_punctuation = True, punctuation = string.punctuation, lemmatize = True, language = 'portuguese', remove_numbers = True, min_chars = 4, remove_stopwords_spacy = True, min_df = 0.1, max_df = 0.8, num_processes = 7)

AttributeError: 'list' object has no attribute 'lower'

vinnyricciardi commented 1 year ago

I'm getting the same issue. The issue only seems to persist if, when using Preprocessing, num_processes is not None or if split=True. Seems like these functions transform a list of strings (e.g., ['dog', 'cat']) to a list of a list of strings (e.g., [['d', 'o', g'], ['c', 'a', 't']])