MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.14k stars 766 forks source link

list input returns FLOAT error in topic_model.fit_transform() #195

Closed sewokim closed 3 years ago

sewokim commented 3 years ago

Hello :) Thank you for sharing your great BERTopic package!

Because I have a trouble in fitting a bit large input documents with an unexpected error. I hope you could help me.

First, I'm using 0.8.1

$ pip show bertopic Name: bertopic Version: 0.8.1 Summary: BERTopic performs topic Modeling with state-of-the-art transformer models. Home-page: https://github.com/MaartenGr/BERTopic Author: Maarten P. Grootendorst Author-email: maartengrootendorst@gmail.com License: UNKNOWN Location: /home/seonwook/anaconda3/lib/python3.8/site-packages Requires: scikit-learn, umap-learn, sentence-transformers, numpy, plotly, pandas, hdbscan, tqdm Required-by:

And whenever I try to fit the documents longer than about 2100,

from bertopic import BERTopic from sklearn.feature_extraction.text import CountVectorizer import pandas as pd import numpy as np

df = pd.read_excel('/home/seonwook/Work/kci/KCI0221A.xlsx', sheet_name=0, usecols=['PY', 'AB'], dtype={'PY':int, 'AB':str}, nrows=2200) docs = df['AB'].values.tolist()

print(len(docs))

vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) topic_model = BERTopic(vectorizer_model=vectorizermodel) topics, = topic_model.fit_transform(docs)

it returns me that FLOAT is being used where STR is expected I guess. Everything works fine if I limit the length of the input list under around 2100.

Thank you for your help in advance!

2200


TypeError Traceback (most recent call last)

in 15 vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) 16 topic_model = BERTopic(vectorizer_model=vectorizer_model) ---> 17 topics, _ = topic_model.fit_transform(docs) ~/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y) 287 288 # Extract topics by calculating c-TF-IDF --> 289 self._extract_topics(documents) 290 291 # Reduce topics ~/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py in _extract_topics(self, documents) 1357 c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic 1358 """ -> 1359 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join}) 1360 self.c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(documents)) 1361 self.topics = self._extract_words_per_topic(words) ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs) 977 978 op = GroupByApply(self, func, args, kwargs) --> 979 result = op.agg() 980 if not is_dict_like(func) and result is not None: 981 return result ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg(self) 159 160 if is_dict_like(arg): --> 161 return self.agg_dict_like() 162 elif is_list_like(arg): 163 # we require a list, but not a 'str' ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg_dict_like(self) 433 else: 434 # key used for column selection and output --> 435 results = { 436 key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items() 437 } ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in (.0) 434 # key used for column selection and output 435 results = { --> 436 key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items() 437 } 438 ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs) 263 264 try: --> 265 return self._python_agg_general(func, *args, **kwargs) 266 except KeyError: 267 # TODO: KeyError is raised in _python_agg_general, ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs) 1324 1325 if not output: -> 1326 return self._python_apply_general(f, self._selected_obj) 1327 1328 return self._wrap_aggregated_output(output) ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data) 1285 data after applying f 1286 """ -> 1287 keys, values, mutated = self.grouper.apply(f, data, self.axis) 1288 1289 return self._wrap_applied_output( ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis) 818 # group might be modified 819 group_axes = group.axes --> 820 res = f(group) 821 if not _is_indexed_like(res, group_axes, axis): 822 mutated = True ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in (x) 1294 def _python_agg_general(self, func, *args, **kwargs): 1295 func = com.is_builtin_func(func) -> 1296 f = lambda x: func(x, *args, **kwargs) 1297 1298 # iterate through "columns" ex exclusions to populate output dict TypeError: sequence item 1: expected str instance, float found
sewokim commented 3 years ago

Problem solved. It was due to NaN values hidden in the input lists. Thank you for your great works.