Closed sewokim closed 3 years ago
Hello :) Thank you for sharing your great BERTopic package!
Because I have a trouble in fitting a bit large input documents with an unexpected error. I hope you could help me.
First, I'm using 0.8.1
$ pip show bertopic Name: bertopic Version: 0.8.1 Summary: BERTopic performs topic Modeling with state-of-the-art transformer models. Home-page: https://github.com/MaartenGr/BERTopic Author: Maarten P. Grootendorst Author-email: maartengrootendorst@gmail.com License: UNKNOWN Location: /home/seonwook/anaconda3/lib/python3.8/site-packages Requires: scikit-learn, umap-learn, sentence-transformers, numpy, plotly, pandas, hdbscan, tqdm Required-by:
And whenever I try to fit the documents longer than about 2100,
from bertopic import BERTopic from sklearn.feature_extraction.text import CountVectorizer import pandas as pd import numpy as np df = pd.read_excel('/home/seonwook/Work/kci/KCI0221A.xlsx', sheet_name=0, usecols=['PY', 'AB'], dtype={'PY':int, 'AB':str}, nrows=2200) docs = df['AB'].values.tolist() print(len(docs)) vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) topic_model = BERTopic(vectorizer_model=vectorizermodel) topics, = topic_model.fit_transform(docs)
from bertopic import BERTopic from sklearn.feature_extraction.text import CountVectorizer import pandas as pd import numpy as np
df = pd.read_excel('/home/seonwook/Work/kci/KCI0221A.xlsx', sheet_name=0, usecols=['PY', 'AB'], dtype={'PY':int, 'AB':str}, nrows=2200) docs = df['AB'].values.tolist()
print(len(docs))
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) topic_model = BERTopic(vectorizer_model=vectorizermodel) topics, = topic_model.fit_transform(docs)
it returns me that FLOAT is being used where STR is expected I guess. Everything works fine if I limit the length of the input list under around 2100.
Thank you for your help in advance!
2200 TypeError Traceback (most recent call last) in 15 vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english", min_df=10) 16 topic_model = BERTopic(vectorizer_model=vectorizer_model) ---> 17 topics, _ = topic_model.fit_transform(docs) ~/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py in fit_transform(self, documents, embeddings, y) 287 288 # Extract topics by calculating c-TF-IDF --> 289 self._extract_topics(documents) 290 291 # Reduce topics ~/anaconda3/lib/python3.8/site-packages/bertopic/_bertopic.py in _extract_topics(self, documents) 1357 c_tf_idf: The resulting matrix giving a value (importance score) for each word per topic 1358 """ -> 1359 documents_per_topic = documents.groupby(['Topic'], as_index=False).agg({'Document': ' '.join}) 1360 self.c_tf_idf, words = self._c_tf_idf(documents_per_topic, m=len(documents)) 1361 self.topics = self._extract_words_per_topic(words) ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs) 977 978 op = GroupByApply(self, func, args, kwargs) --> 979 result = op.agg() 980 if not is_dict_like(func) and result is not None: 981 return result ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg(self) 159 160 if is_dict_like(arg): --> 161 return self.agg_dict_like() 162 elif is_list_like(arg): 163 # we require a list, but not a 'str' ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg_dict_like(self) 433 else: 434 # key used for column selection and output --> 435 results = { 436 key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items() 437 } ~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in (.0) 434 # key used for column selection and output 435 results = { --> 436 key: obj._gotitem(key, ndim=1).agg(how) for key, how in arg.items() 437 } 438 ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs) 263 264 try: --> 265 return self._python_agg_general(func, *args, **kwargs) 266 except KeyError: 267 # TODO: KeyError is raised in _python_agg_general, ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs) 1324 1325 if not output: -> 1326 return self._python_apply_general(f, self._selected_obj) 1327 1328 return self._wrap_aggregated_output(output) ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data) 1285 data after applying f 1286 """ -> 1287 keys, values, mutated = self.grouper.apply(f, data, self.axis) 1288 1289 return self._wrap_applied_output( ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis) 818 # group might be modified 819 group_axes = group.axes --> 820 res = f(group) 821 if not _is_indexed_like(res, group_axes, axis): 822 mutated = True ~/anaconda3/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in (x) 1294 def _python_agg_general(self, func, *args, **kwargs): 1295 func = com.is_builtin_func(func) -> 1296 f = lambda x: func(x, *args, **kwargs) 1297 1298 # iterate through "columns" ex exclusions to populate output dict TypeError: sequence item 1: expected str instance, float found
2200
TypeError Traceback (most recent call last)
Problem solved. It was due to NaN values hidden in the input lists. Thank you for your great works.
Hello :) Thank you for sharing your great BERTopic package!
Because I have a trouble in fitting a bit large input documents with an unexpected error. I hope you could help me.
First, I'm using 0.8.1
And whenever I try to fit the documents longer than about 2100,
it returns me that FLOAT is being used where STR is expected I guess. Everything works fine if I limit the length of the input list under around 2100.
Thank you for your help in advance!