drob-xx / TopicTuner

HDBSCAN Tuning for BERTopic Models
GNU General Public License v3.0
42 stars 1 forks source link

hdbscan gives an array problem #13

Closed chrislalk closed 1 year ago

chrislalk commented 1 year ago

Hi,

suddenly I get the problem that whenever I use randomSearch or GridSearch or anything related to runHDBSCAN, I get the following error:

ValueError: Expected 2D array, got scalar array instead:
array=None.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

It seems like there is a problem with the parameters for runHDBSCAN. Do you have an idea how to deal with that?

Thanks!!!

drob-xx commented 1 year ago

It is likely an issue with how the values are being constructed. It could be a bug. I can't help without seeing the code. Please provide the relevant code and I'll take a look.

chrislalk commented 1 year ago

Sorry forbrevity. Here is the code:

from topictuner import TopicModelTuner as TMT
from hdbscan import HDBSCAN
...

tmt = TMT()
tmt.embeddings = embeddings
tmt.docs = df_pat["Patient"].tolist()
tmt.reduce()

lastRunResultsDF = tmt.randomSearch([*range(30,90)], [.1, .2, .5, .75, 1])
fig = tmt.visualizeSearch(lastRunResultsDF)
fig.show(renderer="browser")

tmt.randomSearch(...) gives the problem. Also, the last reference is: ...\Python\Python310\lib\site-packages\sklearn\utils\validation.py", line 871, in check_array raise ValueError( ValueError: Expected 2D array, got scalar array instead: array=None. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Thanks a lot!

drob-xx commented 1 year ago

Thanks. Not clear yet - there is nothing wrong with the format of the parameters to randomSearch which is what I assume are being complained about. A couple of things:

drob-xx commented 1 year ago

Closing this due to lack of activity. Please re-open as necessary.

echoht commented 3 months ago

hi I have the same problem!!!

my code is: tmt = TMT.wrapBERTopicModel(topic_model) tmt.randomSearch([*range(30, 151)], [.1, .25, .5, .75, 1])

pip show topic modeltuner Name: topicmodeltuner Version: 0.3.4 Summary: HDBSCAN Tuning for BERTopic Models Home-page: https://github.com/drob-xx/TopicTuner Author: Dan Robinson Author-email: drob707@gmail.com License: Location: /opt/conda/envs/llama_py38/lib/python3.8/site-packages Requires: bertopic, loguru Required-by:

python --version Python 3.8.17

the entire exception is :

File /opt/conda/envs/llama_py38/lib/python3.8/site-packages/topictuner/basetuner.py:378, in BaseHDBSCANTuner._runTests(self, searchParams) 373 results = [ 374 (params.cs, params.ss, self.runHDBSCAN(params.cs, params.ss)) 375 for params in tqdm(searchParams) 376 ] 377 else: --> 378 results = [ 379 (params.cs, params.ss, self.runHDBSCAN(params.cs, params.ss)) 380 for params in searchParams 381 ] 382 RunResultsDF = pd.DataFrame() 383 RunResultsDF["min_cluster_size"] = [tupe[0] for tupe in results]

File /opt/conda/envs/llama_py38/lib/python3.8/site-packages/topictuner/basetuner.py:379, in (.0) 373 results = [ 374 (params.cs, params.ss, self.runHDBSCAN(params.cs, params.ss)) 375 for params in tqdm(searchParams) 376 ] 377 else: 378 results = [ --> 379 (params.cs, params.ss, self.runHDBSCAN(params.cs, params.ss)) 380 for params in searchParams 381 ] 382 RunResultsDF = pd.DataFrame() 383 RunResultsDF["min_cluster_size"] = [tupe[0] for tupe in results]

File /opt/conda/envs/llama_py38/lib/python3.8/site-packages/topictuner/basetuner.py:98, in BaseHDBSCANTuner.runHDBSCAN(self, min_cluster_size, min_samples) 94 min_cluster_size, min_samples = self._check_CS_SS( 95 min_cluster_size, min_samples, True 96 ) 97 hdbscan_model = self.getHDBSCAN(min_cluster_size, min_samples) ---> 98 hdbscan_model.fit_predict(self.target_vectors) 99 return hdbscanmodel.labels

File /opt/conda/envs/llamapy38/lib/python3.8/site-packages/hdbscan/hdbscan.py:1243, in HDBSCAN.fit_predict(self, X, y) 1228 def fit_predict(self, X, y=None): 1229 """Performs clustering on X and returns cluster labels. 1230 1231 Parameters (...) 1241 cluster labels 1242 """ -> 1243 self.fit(X) 1244 return self.labels_

File /opt/conda/envs/llamapy38/lib/python3.8/site-packages/hdbscan/hdbscan.py:1167, in HDBSCAN.fit(self, X, y) 1150 """Perform HDBSCAN clustering from features or distance matrix. 1151 1152 Parameters (...) 1162 Returns self 1163 """ 1164 if self.metric != "precomputed": 1165 # Non-precomputed matrices may contain non-finite values. 1166 # Rows with these values -> 1167 X = check_array(X, accept_sparse="csr", force_all_finite=False) 1168 self._raw_data = X 1170 self._all_finite = is_finite(X)

File /opt/conda/envs/llama_py38/lib/python3.8/site-packages/sklearn/utils/validation.py:932, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name) 929 if ensure_2d: 930 # If input is scalar raise error 931 if array.ndim == 0: --> 932 raise ValueError( 933 "Expected 2D array, got scalar array instead:\narray={}.\n" 934 "Reshape your data either using array.reshape(-1, 1) if " 935 "your data has a single feature or array.reshape(1, -1) " 936 "if it contains a single sample.".format(array) 937 ) 938 # If input is 1D raise error 939 if array.ndim == 1:

ValueError: Expected 2D array, got scalar array instead: array=None. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.