bab2min / tomotopy

Python package of Tomoto, the Topic Modeling Tool
https://bab2min.github.io/tomotopy
MIT License
557 stars 62 forks source link

"RuntimeError: Either `words` or `rawWords` must be filled" using `add_doc` sometimes #161

Closed batmanscode closed 1 year ago

batmanscode commented 2 years ago

I have text in a dataframe and was adding it in like this:

for text in df['text']:
    mdl.add_doc(text.strip().split())

This works fine

However, when I tried to remove stopwords before using add_doc I get the error in the title

I'm doing the preprocessing using texthero like this:

import texthero as hero
from texthero import preprocessing

custom_pipeline = [preprocessing.remove_stopwords,
                   preprocessing.remove_digits,
                   preprocessing.remove_punctuation,
                   preprocessing.remove_whitespace]

df['clean_text'] = hero.clean(df['tweet'], custom_pipeline)

for text in df['clean_text']:
    mdl.add_doc(text.strip().split())
RuntimeError: Either `words` or `rawWords` must be filled.

Side note: maybe this could be built into tomotopy using texthero

bab2min commented 2 years ago

Hi @batmanscode , It seems that there is an empty document in your df['clean_text']. Could you check the value of df['clean_text'] to make sure there are no blank documents?

batmanscode commented 2 years ago

@bab2min df['clean_text'].isnull().value_counts() showed no empty values

bab2min commented 2 years ago

@batmanscode df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()
batmanscode commented 2 years ago

@batmanscode df.isnull() tests only if the value is NA or not. Because an empty str '' is not NA, it doesn't show any empty strings. Try following:

df['clean_text'].apply(lambda x:bool(x.strip())).value_counts()

Ah this makes sense, thanks you. There are indeed empty values here. Are there some ways to get tomotopy to skip these? It's not really a problem to remove, but just curious

bab2min commented 2 years ago

@batmanscode Currently, add_doc has no such feature. But I think it's a good idea to add the option to ignore empty docs.

batmanscode commented 2 years ago

@bab2min Agreed. Would be a nice quality of life feature to have