MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6k stars 752 forks source link

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k. #1512

Open ElskeNijhof opened 1 year ago

ElskeNijhof commented 1 year ago

Hi!

When I try to run bertopic() I get the following error:

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

I increased the number of documents to 205504, which should be enough I think.

Does someone have any idea what could cause the problem?

ananaphasia commented 1 year ago

Could you please show your code and stack trace?

ElskeNijhof commented 1 year ago

Yes! This is the full-error: I need to wait for my boss's approval to share the code :)

{ "name": "TypeError", "message": "Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.", "stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:2827\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[1;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[0;32m 2826\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m-> 2827\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mumapmodel\u001b[39m.\u001b[39;49mfit(embeddings, y\u001b[39m=\u001b[39;49my)\n\u001b[0;32m 2828\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mTypeError\u001b[39;00m:\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 2683\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtransformmode \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39membedding\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m-> 2684\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding, aux_data \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_embed_data(\n\u001b[0;32m 2685\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_raw_data[index],\n\u001b[0;32m 2686\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_epochs,\n\u001b[0;32m 2687\u001b[0m init,\n\u001b[0;32m 2688\u001b[0m random_state, \u001b[39m# JH why raw data?\u001b[39;49;00m\n\u001b[0;32m 2689\u001b[0m )\n\u001b[0;32m 2690\u001b[0m \u001b[39m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[0;32m 2691\u001b[0m \u001b[39m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[0;32m 2692\u001b[0m \u001b[39m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[0;32m 2693\u001b[0m \u001b[39m# Might be worth moving this into simplicial_set_embedding or _fit_embeddata\u001b[39;00m\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[1;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[0;32m 2714\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[0;32m 2715\u001b[0m \u001b[39mreplaced by subclasses.\u001b[39;00m\n\u001b[0;32m 2716\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m-> 2717\u001b[0m \u001b[39mreturn\u001b[39;00m simplicial_setembedding(\n\u001b[0;32m 2718\u001b[0m X,\n\u001b[0;32m 2719\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgraph,\n\u001b[0;32m 2720\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_components,\n\u001b[0;32m 2721\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_initial_alpha,\n\u001b[0;32m 2722\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_a,\n\u001b[0;32m 2723\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_b,\n\u001b[0;32m 2724\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrepulsion_strength,\n\u001b[0;32m 2725\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnegative_sample_rate,\n\u001b[0;32m 2726\u001b[0m n_epochs,\n\u001b[0;32m 2727\u001b[0m init,\n\u001b[0;32m 2728\u001b[0m random_state,\n\u001b[0;32m 2729\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_input_distance_func,\n\u001b[0;32m 2730\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_metric_kwds,\n\u001b[0;32m 2731\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdensmap,\n\u001b[0;32m 2732\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_densmap_kwds,\n\u001b[0;32m 2733\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_dens,\n\u001b[0;32m 2734\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_distance_func,\n\u001b[0;32m 2735\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_metric_kwds,\n\u001b[0;32m 2736\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_metric \u001b[39min\u001b[39;49;00m (\u001b[39m\"\u001b[39;49m\u001b[39meuclidean\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39ml2\u001b[39;49m\u001b[39m\"\u001b[39;49m),\n\u001b[0;32m 2737\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrandom_state \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[0;32m 2738\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mverbose,\n\u001b[0;32m 2739\u001b[0m tqdm_kwds\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtqdmkwds,\n\u001b[0;32m 2740\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[1;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[0;32m 1076\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(init, \u001b[39mstr\u001b[39m) \u001b[39mand\u001b[39;00m init \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mspectral\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[0;32m 1077\u001b[0m \u001b[39m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[1;32m-> 1078\u001b[0m initialisation \u001b[39m=\u001b[39m spectral_layout(\n\u001b[0;32m 1079\u001b[0m data,\n\u001b[0;32m 1080\u001b[0m graph,\n\u001b[0;32m 1081\u001b[0m n_components,\n\u001b[0;32m 1082\u001b[0m random_state,\n\u001b[0;32m 1083\u001b[0m metric\u001b[39m=\u001b[39;49mmetric,\n\u001b[0;32m 1084\u001b[0m metric_kwds\u001b[39m=\u001b[39;49mmetric_kwds,\n\u001b[0;32m 1085\u001b[0m )\n\u001b[0;32m 1086\u001b[0m expansion \u001b[39m=\u001b[39m \u001b[39m10.0\u001b[39m \u001b[39m/\u001b[39m np\u001b[39m.\u001b[39mabs(initialisation)\u001b[39m.\u001b[39mmax()\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[1;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[0;32m 331\u001b[0m \u001b[39mif\u001b[39;00m L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m] \u001b[39m<\u001b[39m \u001b[39m2000000\u001b[39m:\n\u001b[1;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39;49msparse\u001b[39m.\u001b[39;49mlinalg\u001b[39m.\u001b[39;49meigsh(\n\u001b[0;32m 333\u001b[0m L,\n\u001b[0;32m 334\u001b[0m k,\n\u001b[0;32m 335\u001b[0m which\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSM\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[0;32m 336\u001b[0m ncv\u001b[39m=\u001b[39;49mnum_lanczos_vectors,\n\u001b[0;32m 337\u001b[0m tol\u001b[39m=\u001b[39;49m\u001b[39m1e-4\u001b[39;49m,\n\u001b[0;32m 338\u001b[0m v0\u001b[39m=\u001b[39;49mnp\u001b[39m.\u001b[39;49mones(L\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m]),\n\u001b[0;32m 339\u001b[0m maxiter\u001b[39m=\u001b[39;49mgraph\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m] \u001b[39m*\u001b[39;49m \u001b[39m5\u001b[39;49m,\n\u001b[0;32m 340\u001b[0m )\n\u001b[0;32m 341\u001b[0m \u001b[39melse\u001b[39;00m:\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[1;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[0;32m 1604\u001b[0m \u001b[39mif\u001b[39;00m issparse(A):\n\u001b[1;32m-> 1605\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1606\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1607\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m reduce k.\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 1608\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(A, LinearOperator):\n\n\u001b[1;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.\n\nDuring handling of the above exception, another exception occurred:\n\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)\nCell \u001b[1;32mIn[44], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m topics_1, probabilities_per_dn_1, topic_model_1 \u001b[39m=\u001b[39m apply_bertopic(df_data_V2)\n\nCell \u001b[1;32mIn[15], line 38\u001b[0m, in \u001b[0;36mapply_bertopic\u001b[1;34m(data)\u001b[0m\n\u001b[0;32m 35\u001b[0m data \u001b[39m=\u001b[39m [\u001b[39mstr\u001b[39m(point) \u001b[39mfor\u001b[39;00m point \u001b[39min\u001b[39;00m data]\n\u001b[0;32m 37\u001b[0m \u001b[39m# generate topics with probabilities per DN\u001b[39;00m\n\u001b[1;32m---> 38\u001b[0m topics, probabilities_per_dn \u001b[39m=\u001b[39m topic_model\u001b[39m.\u001b[39;49mfit_transform(data)\n\u001b[0;32m 40\u001b[0m \u001b[39mreturn\u001b[39;00m topics, probabilities_per_dn, topic_model\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:351\u001b[0m, in \u001b[0;36mBERTopic.fit_transform\u001b[1;34m(self, documents, embeddings, y)\u001b[0m\n\u001b[0;32m 349\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mseed_topic_list \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding_model \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m 350\u001b[0m y, embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_guided_topic_modeling(embeddings)\n\u001b[1;32m--> 351\u001b[0m umap_embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_reduce_dimensionality(embeddings, y)\n\u001b[0;32m 353\u001b[0m \u001b[39m# Cluster reduced embeddings\u001b[39;00m\n\u001b[0;32m 354\u001b[0m documents, probabilities \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_cluster_embeddings(umap_embeddings, documents, y\u001b[39m=\u001b[39my)\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:2831\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[1;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[0;32m 2828\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mTypeError\u001b[39;00m:\n\u001b[0;32m 2829\u001b[0m logger\u001b[39m.\u001b[39minfo(\u001b[39m\"\u001b[39m\u001b[39mThe dimensionality reduction algorithm did not contain the y parameter and\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 2830\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m therefore the y parameter was not used\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m-> 2831\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mumap_model\u001b[39m.\u001b[39;49mfit(embeddings)\n\u001b[0;32m 2833\u001b[0m umap_embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mumapmodel\u001b[39m.\u001b[39mtransform(embeddings)\n\u001b[0;32m 2834\u001b[0m logger\u001b[39m.\u001b[39minfo(\u001b[39m\"\u001b[39m\u001b[39mReduced dimensionality\u001b[39m\u001b[39m\"\u001b[39m)\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 2681\u001b[0m \u001b[39mprint\u001b[39m(ts(), \u001b[39m\"\u001b[39m\u001b[39mConstruct embedding\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 2683\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtransformmode \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39membedding\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m-> 2684\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding, aux_data \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_embed_data(\n\u001b[0;32m 2685\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_raw_data[index],\n\u001b[0;32m 2686\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_epochs,\n\u001b[0;32m 2687\u001b[0m init,\n\u001b[0;32m 2688\u001b[0m random_state, \u001b[39m# JH why raw data?\u001b[39;49;00m\n\u001b[0;32m 2689\u001b[0m )\n\u001b[0;32m 2690\u001b[0m \u001b[39m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[0;32m 2691\u001b[0m \u001b[39m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[0;32m 2692\u001b[0m \u001b[39m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[0;32m 2693\u001b[0m \u001b[39m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n\u001b[0;32m 2694\u001b[0m disconnectedvertices \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39marray(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgraph\u001b[39m.\u001b[39msum(axis\u001b[39m=\u001b[39m\u001b[39m1\u001b[39m))\u001b[39m.\u001b[39mflatten() \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[1;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[0;32m 2713\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_fit_embed_data\u001b[39m(\u001b[39mself\u001b[39m, X, n_epochs, init, random_state):\n\u001b[0;32m 2714\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[0;32m 2715\u001b[0m \u001b[39m replaced by subclasses.\u001b[39;00m\n\u001b[0;32m 2716\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m-> 2717\u001b[0m \u001b[39mreturn\u001b[39;00m simplicial_setembedding(\n\u001b[0;32m 2718\u001b[0m X,\n\u001b[0;32m 2719\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgraph,\n\u001b[0;32m 2720\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_components,\n\u001b[0;32m 2721\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_initial_alpha,\n\u001b[0;32m 2722\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_a,\n\u001b[0;32m 2723\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_b,\n\u001b[0;32m 2724\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrepulsion_strength,\n\u001b[0;32m 2725\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnegative_sample_rate,\n\u001b[0;32m 2726\u001b[0m n_epochs,\n\u001b[0;32m 2727\u001b[0m init,\n\u001b[0;32m 2728\u001b[0m random_state,\n\u001b[0;32m 2729\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_input_distance_func,\n\u001b[0;32m 2730\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_metric_kwds,\n\u001b[0;32m 2731\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdensmap,\n\u001b[0;32m 2732\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_densmap_kwds,\n\u001b[0;32m 2733\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_dens,\n\u001b[0;32m 2734\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_distance_func,\n\u001b[0;32m 2735\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_metric_kwds,\n\u001b[0;32m 2736\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_metric \u001b[39min\u001b[39;49;00m (\u001b[39m\"\u001b[39;49m\u001b[39meuclidean\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39ml2\u001b[39;49m\u001b[39m\"\u001b[39;49m),\n\u001b[0;32m 2737\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrandom_state \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[0;32m 2738\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mverbose,\n\u001b[0;32m 2739\u001b[0m tqdm_kwds\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtqdmkwds,\n\u001b[0;32m 2740\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[1;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[0;32m 1073\u001b[0m embedding \u001b[39m=\u001b[39m random_state\u001b[39m.\u001b[39muniform(\n\u001b[0;32m 1074\u001b[0m low\u001b[39m=\u001b[39m\u001b[39m-\u001b[39m\u001b[39m10.0\u001b[39m, high\u001b[39m=\u001b[39m\u001b[39m10.0\u001b[39m, size\u001b[39m=\u001b[39m(graph\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m], n_components)\n\u001b[0;32m 1075\u001b[0m )\u001b[39m.\u001b[39mastype(np\u001b[39m.\u001b[39mfloat32)\n\u001b[0;32m 1076\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(init, \u001b[39mstr\u001b[39m) \u001b[39mand\u001b[39;00m init \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mspectral\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[0;32m 1077\u001b[0m \u001b[39m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[1;32m-> 1078\u001b[0m initialisation \u001b[39m=\u001b[39m spectral_layout(\n\u001b[0;32m 1079\u001b[0m data,\n\u001b[0;32m 1080\u001b[0m graph,\n\u001b[0;32m 1081\u001b[0m n_components,\n\u001b[0;32m 1082\u001b[0m random_state,\n\u001b[0;32m 1083\u001b[0m metric\u001b[39m=\u001b[39;49mmetric,\n\u001b[0;32m 1084\u001b[0m metric_kwds\u001b[39m=\u001b[39;49mmetric_kwds,\n\u001b[0;32m 1085\u001b[0m )\n\u001b[0;32m 1086\u001b[0m expansion \u001b[39m=\u001b[39m \u001b[39m10.0\u001b[39m \u001b[39m/\u001b[39m np\u001b[39m.\u001b[39mabs(initialisation)\u001b[39m.\u001b[39mmax()\n\u001b[0;32m 1087\u001b[0m embedding \u001b[39m=\u001b[39m (initialisation \u001b[39m\u001b[39m expansion)\u001b[39m.\u001b[39mastype(\n\u001b[0;32m 1088\u001b[0m np\u001b[39m.\u001b[39mfloat32\n\u001b[0;32m 1089\u001b[0m ) \u001b[39m+\u001b[39m random_state\u001b[39m.\u001b[39mnormal(\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 1092\u001b[0m np\u001b[39m.\u001b[39mfloat32\n\u001b[0;32m 1093\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[1;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[0;32m 330\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m 331\u001b[0m \u001b[39mif\u001b[39;00m L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m] \u001b[39m<\u001b[39m \u001b[39m2000000\u001b[39m:\n\u001b[1;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39;49msparse\u001b[39m.\u001b[39;49mlinalg\u001b[39m.\u001b[39;49meigsh(\n\u001b[0;32m 333\u001b[0m L,\n\u001b[0;32m 334\u001b[0m k,\n\u001b[0;32m 335\u001b[0m which\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSM\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[0;32m 336\u001b[0m ncv\u001b[39m=\u001b[39;49mnum_lanczos_vectors,\n\u001b[0;32m 337\u001b[0m tol\u001b[39m=\u001b[39;49m\u001b[39m1e-4\u001b[39;49m,\n\u001b[0;32m 338\u001b[0m v0\u001b[39m=\u001b[39;49mnp\u001b[39m.\u001b[39;49mones(L\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m]),\n\u001b[0;32m 339\u001b[0m maxiter\u001b[39m=\u001b[39;49mgraph\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m] \u001b[39m\u001b[39;49m \u001b[39m5\u001b[39;49m,\n\u001b[0;32m 340\u001b[0m )\n\u001b[0;32m 341\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 342\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39msparse\u001b[39m.\u001b[39mlinalg\u001b[39m.\u001b[39mlobpcg(\n\u001b[0;32m 343\u001b[0m L, random_state\u001b[39m.\u001b[39mnormal(size\u001b[39m=\u001b[39m(L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m], k)), largest\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, tol\u001b[39m=\u001b[39m\u001b[39m1e-8\u001b[39m\n\u001b[0;32m 344\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[1;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[0;32m 1600\u001b[0m warnings\u001b[39m.\u001b[39mwarn(\u001b[39m\"\u001b[39m\u001b[39mk >= N for N * N square matrix. \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1601\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAttempting to use scipy.linalg.eigh instead.\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 1602\u001b[0m \u001b[39mRuntimeWarning\u001b[39;00m)\n\u001b[0;32m 1604\u001b[0m \u001b[39mif\u001b[39;00m issparse(A):\n\u001b[1;32m-> 1605\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1606\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1607\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m reduce k.\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 1608\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(A, LinearOperator):\n\u001b[0;32m 1609\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for LinearOperator \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1610\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mA with k >= N.\u001b[39m\u001b[39m\"\u001b[39m)\n\n\u001b[1;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k." }

ndettmer commented 10 months ago

Hello, I'm getting the same error just from running topic_model.visualize_topics() after executing the first block of the quickstart code on my CPU. On my GPU (on another machine) it worked just fine.

Pip Package Versions:

umap-learn==0.5.4
bertopic==0.15.0
scipy==1.11.3

OS: Ubuntu 22.04.1 CPU: 12th Gen Intel(R) Core(TM) i7-1260P

Stack Trace:

----> [1](vscode-notebook-cell:***/BERTopic_playground.ipynb#X11sZmlsZQ%3D%3D?line=0) topic_model.visualize_topics()

File [***/venv/lib/python3.10/site-packages/bertopic/_bertopic.py:2193](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/bertopic/_bertopic.py:2193), in BERTopic.visualize_topics(self, topics, top_n_topics, custom_labels, title, width, height)
   2163 """ Visualize topics, their sizes, and their corresponding words
   2164 
   2165 This visualization is highly inspired by LDAvis, a great visualization
   (...)
   2190 ```
   2191 """
   2192 check_is_fitted(self)
-> 2193 return plotting.visualize_topics(self,
   2194                                  topics=topics,
   2195                                  top_n_topics=top_n_topics,
   2196                                  custom_labels=custom_labels,
   2197                                  title=title,
   2198                                  width=width,
   2199                                  height=height)

File [***/venv/lib/python3.10/site-packages/bertopic/plotting/_topics.py:79](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/bertopic/plotting/_topics.py:79), in visualize_topics(topic_model, topics, top_n_topics, custom_labels, title, width, height)
     77 if topic_model.topic_embeddings_ is not None:
     78     embeddings = topic_model.topic_embeddings_[indices]
---> 79     embeddings = UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42).fit_transform(embeddings)
     80 else:
     81     embeddings = topic_model.c_tf_idf_.toarray()[indices]

File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2887](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2887), in UMAP.fit_transform(self, X, y, force_all_finite)
   2851 def fit_transform(self, X, y=None, force_all_finite=True):
   2852     """Fit X into an embedded space and return that transformed
   2853     output.
   2854 
   (...)
   2885         Local radii of data points in the embedding (log-transformed).
   2886     """
-> 2887     self.fit(X, y, force_all_finite)
   2888     if self.transform_mode == "embedding":
   2889         if self.output_dens:

File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2780](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2780), in UMAP.fit(self, X, y, force_all_finite)
   2776 if self.transform_mode == "embedding":
   2777     epochs = (
   2778         self.n_epochs_list if self.n_epochs_list is not None else self.n_epochs
   2779     )
-> 2780     self.embedding_, aux_data = self._fit_embed_data(
   2781         self._raw_data[index],
   2782         epochs,
   2783         init,
   2784         random_state,  # JH why raw data?
   2785     )
   2787     if self.n_epochs_list is not None:
   2788         if "embedding_list" not in aux_data:

File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2826](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2826), in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
   2822 def _fit_embed_data(self, X, n_epochs, init, random_state):
   2823     """A method wrapper for simplicial_set_embedding that can be
   2824     replaced by subclasses.
   2825     """
-> 2826     return simplicial_set_embedding(
   2827         X,
   2828         self.graph_,
   2829         self.n_components,
   2830         self._initial_alpha,
   2831         self._a,
   2832         self._b,
   2833         self.repulsion_strength,
   2834         self.negative_sample_rate,
   2835         n_epochs,
   2836         init,
   2837         random_state,
   2838         self._input_distance_func,
   2839         self._metric_kwds,
   2840         self.densmap,
   2841         self._densmap_kwds,
   2842         self.output_dens,
   2843         self._output_distance_func,
   2844         self._output_metric_kwds,
   2845         self.output_metric in ("euclidean", "l2"),
   2846         self.random_state is None,
   2847         self.verbose,
   2848         tqdm_kwds=self.tqdm_kwds,
   2849     )

File [***/venv/lib/python3.10/site-packages/umap/umap_.py:1106](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:1106), in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
   1102     embedding = noisy_scale_coords(
   1103         embedding, random_state, max_coord=10, noise=0.0001
   1104     )
   1105 elif isinstance(init, str) and init == "spectral":
-> 1106     embedding = spectral_layout(
   1107         data,
   1108         graph,
   1109         n_components,
   1110         random_state,
   1111         metric=metric,
   1112         metric_kwds=metric_kwds,
   1113     )
   1114     # We add a little noise to avoid local minima for optimization to come
   1115     embedding = noisy_scale_coords(
   1116         embedding, random_state, max_coord=10, noise=0.0001
   1117     )

File [***/venv/lib/python3.10/site-packages/umap/spectral.py:304](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/spectral.py:304), in spectral_layout(data, graph, dim, random_state, metric, metric_kwds, tol, maxiter)
    263 def spectral_layout(
    264     data,
    265     graph,
   (...)
    271     maxiter=0
    272 ):
    273     """
    274     Given a graph compute the spectral embedding of the graph. This is
    275     simply the eigenvectors of the laplacian of the graph. Here we use the
   (...)
    302         The spectral embedding of the graph.
    303     """
--> 304     return _spectral_layout(
    305         data=data,
    306         graph=graph,
    307         dim=dim,
    308         random_state=random_state,
    309         metric=metric,
    310         metric_kwds=metric_kwds,
    311         init="random",
    312         tol=tol,
    313         maxiter=maxiter
    314     )

File [***/venv/lib/python3.10/site-packages/umap/spectral.py:521](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/spectral.py:521), in _spectral_layout(data, graph, dim, random_state, metric, metric_kwds, init, method, tol, maxiter)
    518 X[:, 0] = sqrt_deg [/](https://file+.vscode-resource.vscode-cdn.net/) np.linalg.norm(sqrt_deg)
    520 if method == "eigsh":
--> 521     eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
    522         L,
    523         k,
    524         which="SM",
    525         ncv=num_lanczos_vectors,
    526         tol=tol or 1e-4,
    527         v0=np.ones(L.shape[0]),
    528         maxiter=maxiter or graph.shape[0] * 5,
    529     )
    530 elif method == "lobpcg":
    531     with warnings.catch_warnings():

File [***/venv/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605), in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1600 warnings.warn("k >= N for N * N square matrix. "
   1601               "Attempting to use scipy.linalg.eigh instead.",
   1602               RuntimeWarning)
   1604 if issparse(A):
-> 1605     raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
   1606                     "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
   1607                     " reduce k.")
   1608 if isinstance(A, LinearOperator):
   1609     raise TypeError("Cannot use scipy.linalg.eigh for LinearOperator "
   1610                     "A with k >= N.")

TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
MaartenGr commented 10 months ago

@ndettmer This might relate to the number of topics that you have. If there are few, for example less than 10, it might throw an error because it has issues reducing dimensionality on such a small dataset.

ndettmer commented 10 months ago

@MaartenGr thank you! I actually just took a subset of 20newsgroups. Therefore, the resulting number of topics was quite low. With a larger subset it worked.

gnanukoth commented 4 months ago

@MaartenGr So in such cases where the number of topics is less, how would you suggest to calculate the x,y representation for the topics for visualisation purpose?

MaartenGr commented 4 months ago

@gnanukoth I believe it is possible by lowering the n_neighbors parameter when using UMAP which should solve the issue. I remember there being another issue discussing this with a potential solution but you would have to search the issues page with .visualize_topics.

gnanukoth commented 4 months ago

Noted @MaartenGr , I will search through the issues, Thanks!

gnanukoth commented 4 months ago

Leaving the possible solution found in another issue here: link, for easy reference. Thanks!

gnanukoth commented 4 months ago

Hello @MaartenGr, the proposed solution in the other issue also works only for No. of topics >= 4. I tried using PCA for cases where No. of topics < 4, PCA seems to work with the plotly visualization. Do you think using PCA would be meaningful for such situations? Is there something else that I should consider when using PCA for this visualization?

MaartenGr commented 4 months ago

@gnanukoth In all honesty, these dimensionality reduction algorithms generally tend to work better the more data is available since they are trained. So if you have 4 or fewer data points, consider whether it is actually meaningful to perform it at all. I wonder how well such an algorithm can be with such a small dataset.