Open ElskeNijhof opened 1 year ago
Could you please show your code and stack trace?
Yes! This is the full-error: I need to wait for my boss's approval to share the code :)
{
"name": "TypeError",
"message": "Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.",
"stack": "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:2827\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[1;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[0;32m 2826\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[1;32m-> 2827\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mumapmodel\u001b[39m.\u001b[39;49mfit(embeddings, y\u001b[39m=\u001b[39;49my)\n\u001b[0;32m 2828\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mTypeError\u001b[39;00m:\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 2683\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtransformmode \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39membedding\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m-> 2684\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding, aux_data \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_embed_data(\n\u001b[0;32m 2685\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_raw_data[index],\n\u001b[0;32m 2686\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_epochs,\n\u001b[0;32m 2687\u001b[0m init,\n\u001b[0;32m 2688\u001b[0m random_state, \u001b[39m# JH why raw data?\u001b[39;49;00m\n\u001b[0;32m 2689\u001b[0m )\n\u001b[0;32m 2690\u001b[0m \u001b[39m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[0;32m 2691\u001b[0m \u001b[39m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[0;32m 2692\u001b[0m \u001b[39m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[0;32m 2693\u001b[0m \u001b[39m# Might be worth moving this into simplicial_set_embedding or _fit_embeddata\u001b[39;00m\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[1;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[0;32m 2714\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[0;32m 2715\u001b[0m \u001b[39mreplaced by subclasses.\u001b[39;00m\n\u001b[0;32m 2716\u001b[0m \u001b[39m\"\"\"\u001b[39;00m\n\u001b[1;32m-> 2717\u001b[0m \u001b[39mreturn\u001b[39;00m simplicial_setembedding(\n\u001b[0;32m 2718\u001b[0m X,\n\u001b[0;32m 2719\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgraph,\n\u001b[0;32m 2720\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_components,\n\u001b[0;32m 2721\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_initial_alpha,\n\u001b[0;32m 2722\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_a,\n\u001b[0;32m 2723\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_b,\n\u001b[0;32m 2724\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrepulsion_strength,\n\u001b[0;32m 2725\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnegative_sample_rate,\n\u001b[0;32m 2726\u001b[0m n_epochs,\n\u001b[0;32m 2727\u001b[0m init,\n\u001b[0;32m 2728\u001b[0m random_state,\n\u001b[0;32m 2729\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_input_distance_func,\n\u001b[0;32m 2730\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_metric_kwds,\n\u001b[0;32m 2731\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdensmap,\n\u001b[0;32m 2732\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_densmap_kwds,\n\u001b[0;32m 2733\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_dens,\n\u001b[0;32m 2734\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_distance_func,\n\u001b[0;32m 2735\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_metric_kwds,\n\u001b[0;32m 2736\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_metric \u001b[39min\u001b[39;49;00m (\u001b[39m\"\u001b[39;49m\u001b[39meuclidean\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39ml2\u001b[39;49m\u001b[39m\"\u001b[39;49m),\n\u001b[0;32m 2737\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrandom_state \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[0;32m 2738\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mverbose,\n\u001b[0;32m 2739\u001b[0m tqdm_kwds\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtqdmkwds,\n\u001b[0;32m 2740\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[1;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[0;32m 1076\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(init, \u001b[39mstr\u001b[39m) \u001b[39mand\u001b[39;00m init \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mspectral\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[0;32m 1077\u001b[0m \u001b[39m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[1;32m-> 1078\u001b[0m initialisation \u001b[39m=\u001b[39m spectral_layout(\n\u001b[0;32m 1079\u001b[0m data,\n\u001b[0;32m 1080\u001b[0m graph,\n\u001b[0;32m 1081\u001b[0m n_components,\n\u001b[0;32m 1082\u001b[0m random_state,\n\u001b[0;32m 1083\u001b[0m metric\u001b[39m=\u001b[39;49mmetric,\n\u001b[0;32m 1084\u001b[0m metric_kwds\u001b[39m=\u001b[39;49mmetric_kwds,\n\u001b[0;32m 1085\u001b[0m )\n\u001b[0;32m 1086\u001b[0m expansion \u001b[39m=\u001b[39m \u001b[39m10.0\u001b[39m \u001b[39m/\u001b[39m np\u001b[39m.\u001b[39mabs(initialisation)\u001b[39m.\u001b[39mmax()\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[1;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[0;32m 331\u001b[0m \u001b[39mif\u001b[39;00m L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m] \u001b[39m<\u001b[39m \u001b[39m2000000\u001b[39m:\n\u001b[1;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39;49msparse\u001b[39m.\u001b[39;49mlinalg\u001b[39m.\u001b[39;49meigsh(\n\u001b[0;32m 333\u001b[0m L,\n\u001b[0;32m 334\u001b[0m k,\n\u001b[0;32m 335\u001b[0m which\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSM\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[0;32m 336\u001b[0m ncv\u001b[39m=\u001b[39;49mnum_lanczos_vectors,\n\u001b[0;32m 337\u001b[0m tol\u001b[39m=\u001b[39;49m\u001b[39m1e-4\u001b[39;49m,\n\u001b[0;32m 338\u001b[0m v0\u001b[39m=\u001b[39;49mnp\u001b[39m.\u001b[39;49mones(L\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m]),\n\u001b[0;32m 339\u001b[0m maxiter\u001b[39m=\u001b[39;49mgraph\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m] \u001b[39m*\u001b[39;49m \u001b[39m5\u001b[39;49m,\n\u001b[0;32m 340\u001b[0m )\n\u001b[0;32m 341\u001b[0m \u001b[39melse\u001b[39;00m:\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[1;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[0;32m 1604\u001b[0m \u001b[39mif\u001b[39;00m issparse(A):\n\u001b[1;32m-> 1605\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1606\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1607\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m reduce k.\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 1608\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(A, LinearOperator):\n\n\u001b[1;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.\n\nDuring handling of the above exception, another exception occurred:\n\n\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)\nCell \u001b[1;32mIn[44], line 1\u001b[0m\n\u001b[1;32m----> 1\u001b[0m topics_1, probabilities_per_dn_1, topic_model_1 \u001b[39m=\u001b[39m apply_bertopic(df_data_V2)\n\nCell \u001b[1;32mIn[15], line 38\u001b[0m, in \u001b[0;36mapply_bertopic\u001b[1;34m(data)\u001b[0m\n\u001b[0;32m 35\u001b[0m data \u001b[39m=\u001b[39m [\u001b[39mstr\u001b[39m(point) \u001b[39mfor\u001b[39;00m point \u001b[39min\u001b[39;00m data]\n\u001b[0;32m 37\u001b[0m \u001b[39m# generate topics with probabilities per DN\u001b[39;00m\n\u001b[1;32m---> 38\u001b[0m topics, probabilities_per_dn \u001b[39m=\u001b[39m topic_model\u001b[39m.\u001b[39;49mfit_transform(data)\n\u001b[0;32m 40\u001b[0m \u001b[39mreturn\u001b[39;00m topics, probabilities_per_dn, topic_model\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:351\u001b[0m, in \u001b[0;36mBERTopic.fit_transform\u001b[1;34m(self, documents, embeddings, y)\u001b[0m\n\u001b[0;32m 349\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mseed_topic_list \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding_model \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[0;32m 350\u001b[0m y, embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_guided_topic_modeling(embeddings)\n\u001b[1;32m--> 351\u001b[0m umap_embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_reduce_dimensionality(embeddings, y)\n\u001b[0;32m 353\u001b[0m \u001b[39m# Cluster reduced embeddings\u001b[39;00m\n\u001b[0;32m 354\u001b[0m documents, probabilities \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_cluster_embeddings(umap_embeddings, documents, y\u001b[39m=\u001b[39my)\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\bertopic\_bertopic.py:2831\u001b[0m, in \u001b[0;36mBERTopic._reduce_dimensionality\u001b[1;34m(self, embeddings, y, partial_fit)\u001b[0m\n\u001b[0;32m 2828\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mTypeError\u001b[39;00m:\n\u001b[0;32m 2829\u001b[0m logger\u001b[39m.\u001b[39minfo(\u001b[39m\"\u001b[39m\u001b[39mThe dimensionality reduction algorithm did not contain the y
parameter and\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 2830\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m therefore the y
parameter was not used\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[1;32m-> 2831\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mumap_model\u001b[39m.\u001b[39;49mfit(embeddings)\n\u001b[0;32m 2833\u001b[0m umap_embeddings \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mumapmodel\u001b[39m.\u001b[39mtransform(embeddings)\n\u001b[0;32m 2834\u001b[0m logger\u001b[39m.\u001b[39minfo(\u001b[39m\"\u001b[39m\u001b[39mReduced dimensionality\u001b[39m\u001b[39m\"\u001b[39m)\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:2684\u001b[0m, in \u001b[0;36mUMAP.fit\u001b[1;34m(self, X, y)\u001b[0m\n\u001b[0;32m 2681\u001b[0m \u001b[39mprint\u001b[39m(ts(), \u001b[39m\"\u001b[39m\u001b[39mConstruct embedding\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 2683\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mtransformmode \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39membedding\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[1;32m-> 2684\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39membedding, aux_data \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_fit_embed_data(\n\u001b[0;32m 2685\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_raw_data[index],\n\u001b[0;32m 2686\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_epochs,\n\u001b[0;32m 2687\u001b[0m init,\n\u001b[0;32m 2688\u001b[0m random_state, \u001b[39m# JH why raw data?\u001b[39;49;00m\n\u001b[0;32m 2689\u001b[0m )\n\u001b[0;32m 2690\u001b[0m \u001b[39m# Assign any points that are fully disconnected from our manifold(s) to have embedding\u001b[39;00m\n\u001b[0;32m 2691\u001b[0m \u001b[39m# coordinates of np.nan. These will be filtered by our plotting functions automatically.\u001b[39;00m\n\u001b[0;32m 2692\u001b[0m \u001b[39m# They also prevent users from being deceived a distance query to one of these points.\u001b[39;00m\n\u001b[0;32m 2693\u001b[0m \u001b[39m# Might be worth moving this into simplicial_set_embedding or _fit_embed_data\u001b[39;00m\n\u001b[0;32m 2694\u001b[0m disconnectedvertices \u001b[39m=\u001b[39m np\u001b[39m.\u001b[39marray(\u001b[39mself\u001b[39m\u001b[39m.\u001b[39mgraph\u001b[39m.\u001b[39msum(axis\u001b[39m=\u001b[39m\u001b[39m1\u001b[39m))\u001b[39m.\u001b[39mflatten() \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap_.py:2717\u001b[0m, in \u001b[0;36mUMAP._fit_embed_data\u001b[1;34m(self, X, n_epochs, init, random_state)\u001b[0m\n\u001b[0;32m 2713\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39m_fit_embed_data\u001b[39m(\u001b[39mself\u001b[39m, X, n_epochs, init, random_state):\n\u001b[0;32m 2714\u001b[0m \u001b[39m \u001b[39m\u001b[39m\"\"\"A method wrapper for simplicial_set_embedding that can be\u001b[39;00m\n\u001b[0;32m 2715\u001b[0m \u001b[39m replaced by subclasses.\u001b[39;00m\n\u001b[0;32m 2716\u001b[0m \u001b[39m \"\"\"\u001b[39;00m\n\u001b[1;32m-> 2717\u001b[0m \u001b[39mreturn\u001b[39;00m simplicial_setembedding(\n\u001b[0;32m 2718\u001b[0m X,\n\u001b[0;32m 2719\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mgraph,\n\u001b[0;32m 2720\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mn_components,\n\u001b[0;32m 2721\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_initial_alpha,\n\u001b[0;32m 2722\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_a,\n\u001b[0;32m 2723\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_b,\n\u001b[0;32m 2724\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrepulsion_strength,\n\u001b[0;32m 2725\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mnegative_sample_rate,\n\u001b[0;32m 2726\u001b[0m n_epochs,\n\u001b[0;32m 2727\u001b[0m init,\n\u001b[0;32m 2728\u001b[0m random_state,\n\u001b[0;32m 2729\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_input_distance_func,\n\u001b[0;32m 2730\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_metric_kwds,\n\u001b[0;32m 2731\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdensmap,\n\u001b[0;32m 2732\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_densmap_kwds,\n\u001b[0;32m 2733\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_dens,\n\u001b[0;32m 2734\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_distance_func,\n\u001b[0;32m 2735\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_output_metric_kwds,\n\u001b[0;32m 2736\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49moutput_metric \u001b[39min\u001b[39;49;00m (\u001b[39m\"\u001b[39;49m\u001b[39meuclidean\u001b[39;49m\u001b[39m\"\u001b[39;49m, \u001b[39m\"\u001b[39;49m\u001b[39ml2\u001b[39;49m\u001b[39m\"\u001b[39;49m),\n\u001b[0;32m 2737\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mrandom_state \u001b[39mis\u001b[39;49;00m \u001b[39mNone\u001b[39;49;00m,\n\u001b[0;32m 2738\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mverbose,\n\u001b[0;32m 2739\u001b[0m tqdm_kwds\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mtqdmkwds,\n\u001b[0;32m 2740\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\umap.py:1078\u001b[0m, in \u001b[0;36msimplicial_set_embedding\u001b[1;34m(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)\u001b[0m\n\u001b[0;32m 1073\u001b[0m embedding \u001b[39m=\u001b[39m random_state\u001b[39m.\u001b[39muniform(\n\u001b[0;32m 1074\u001b[0m low\u001b[39m=\u001b[39m\u001b[39m-\u001b[39m\u001b[39m10.0\u001b[39m, high\u001b[39m=\u001b[39m\u001b[39m10.0\u001b[39m, size\u001b[39m=\u001b[39m(graph\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m], n_components)\n\u001b[0;32m 1075\u001b[0m )\u001b[39m.\u001b[39mastype(np\u001b[39m.\u001b[39mfloat32)\n\u001b[0;32m 1076\u001b[0m \u001b[39melif\u001b[39;00m \u001b[39misinstance\u001b[39m(init, \u001b[39mstr\u001b[39m) \u001b[39mand\u001b[39;00m init \u001b[39m==\u001b[39m \u001b[39m\"\u001b[39m\u001b[39mspectral\u001b[39m\u001b[39m\"\u001b[39m:\n\u001b[0;32m 1077\u001b[0m \u001b[39m# We add a little noise to avoid local minima for optimization to come\u001b[39;00m\n\u001b[1;32m-> 1078\u001b[0m initialisation \u001b[39m=\u001b[39m spectral_layout(\n\u001b[0;32m 1079\u001b[0m data,\n\u001b[0;32m 1080\u001b[0m graph,\n\u001b[0;32m 1081\u001b[0m n_components,\n\u001b[0;32m 1082\u001b[0m random_state,\n\u001b[0;32m 1083\u001b[0m metric\u001b[39m=\u001b[39;49mmetric,\n\u001b[0;32m 1084\u001b[0m metric_kwds\u001b[39m=\u001b[39;49mmetric_kwds,\n\u001b[0;32m 1085\u001b[0m )\n\u001b[0;32m 1086\u001b[0m expansion \u001b[39m=\u001b[39m \u001b[39m10.0\u001b[39m \u001b[39m/\u001b[39m np\u001b[39m.\u001b[39mabs(initialisation)\u001b[39m.\u001b[39mmax()\n\u001b[0;32m 1087\u001b[0m embedding \u001b[39m=\u001b[39m (initialisation \u001b[39m\u001b[39m expansion)\u001b[39m.\u001b[39mastype(\n\u001b[0;32m 1088\u001b[0m np\u001b[39m.\u001b[39mfloat32\n\u001b[0;32m 1089\u001b[0m ) \u001b[39m+\u001b[39m random_state\u001b[39m.\u001b[39mnormal(\n\u001b[1;32m (...)\u001b[0m\n\u001b[0;32m 1092\u001b[0m np\u001b[39m.\u001b[39mfloat32\n\u001b[0;32m 1093\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\umap\spectral.py:332\u001b[0m, in \u001b[0;36mspectral_layout\u001b[1;34m(data, graph, dim, random_state, metric, metric_kwds)\u001b[0m\n\u001b[0;32m 330\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m 331\u001b[0m \u001b[39mif\u001b[39;00m L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m] \u001b[39m<\u001b[39m \u001b[39m2000000\u001b[39m:\n\u001b[1;32m--> 332\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39;49msparse\u001b[39m.\u001b[39;49mlinalg\u001b[39m.\u001b[39;49meigsh(\n\u001b[0;32m 333\u001b[0m L,\n\u001b[0;32m 334\u001b[0m k,\n\u001b[0;32m 335\u001b[0m which\u001b[39m=\u001b[39;49m\u001b[39m\"\u001b[39;49m\u001b[39mSM\u001b[39;49m\u001b[39m\"\u001b[39;49m,\n\u001b[0;32m 336\u001b[0m ncv\u001b[39m=\u001b[39;49mnum_lanczos_vectors,\n\u001b[0;32m 337\u001b[0m tol\u001b[39m=\u001b[39;49m\u001b[39m1e-4\u001b[39;49m,\n\u001b[0;32m 338\u001b[0m v0\u001b[39m=\u001b[39;49mnp\u001b[39m.\u001b[39;49mones(L\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m]),\n\u001b[0;32m 339\u001b[0m maxiter\u001b[39m=\u001b[39;49mgraph\u001b[39m.\u001b[39;49mshape[\u001b[39m0\u001b[39;49m] \u001b[39m\u001b[39;49m \u001b[39m5\u001b[39;49m,\n\u001b[0;32m 340\u001b[0m )\n\u001b[0;32m 341\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[0;32m 342\u001b[0m eigenvalues, eigenvectors \u001b[39m=\u001b[39m scipy\u001b[39m.\u001b[39msparse\u001b[39m.\u001b[39mlinalg\u001b[39m.\u001b[39mlobpcg(\n\u001b[0;32m 343\u001b[0m L, random_state\u001b[39m.\u001b[39mnormal(size\u001b[39m=\u001b[39m(L\u001b[39m.\u001b[39mshape[\u001b[39m0\u001b[39m], k)), largest\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m, tol\u001b[39m=\u001b[39m\u001b[39m1e-8\u001b[39m\n\u001b[0;32m 344\u001b[0m )\n\nFile \u001b[1;32mc:\Users\elnijhof\AppData\Local\anaconda3\envs\py39bertopic\lib\site-packages\scipy\sparse\linalg\_eigen\arpack\arpack.py:1605\u001b[0m, in \u001b[0;36meigsh\u001b[1;34m(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)\u001b[0m\n\u001b[0;32m 1600\u001b[0m warnings\u001b[39m.\u001b[39mwarn(\u001b[39m\"\u001b[39m\u001b[39mk >= N for N * N square matrix. \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1601\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mAttempting to use scipy.linalg.eigh instead.\u001b[39m\u001b[39m\"\u001b[39m,\n\u001b[0;32m 1602\u001b[0m \u001b[39mRuntimeWarning\u001b[39;00m)\n\u001b[0;32m 1604\u001b[0m \u001b[39mif\u001b[39;00m issparse(A):\n\u001b[1;32m-> 1605\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for sparse A with \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1606\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mk >= N. Use scipy.linalg.eigh(A.toarray()) or\u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1607\u001b[0m \u001b[39m\"\u001b[39m\u001b[39m reduce k.\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m 1608\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39misinstance\u001b[39m(A, LinearOperator):\n\u001b[0;32m 1609\u001b[0m \u001b[39mraise\u001b[39;00m \u001b[39mTypeError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mCannot use scipy.linalg.eigh for LinearOperator \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m 1610\u001b[0m \u001b[39m\"\u001b[39m\u001b[39mA with k >= N.\u001b[39m\u001b[39m\"\u001b[39m)\n\n\u001b[1;31mTypeError\u001b[0m: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k."
}
Hello, I'm getting the same error just from running topic_model.visualize_topics()
after executing the first block of the quickstart code on my CPU. On my GPU (on another machine) it worked just fine.
Pip Package Versions:
umap-learn==0.5.4
bertopic==0.15.0
scipy==1.11.3
OS: Ubuntu 22.04.1 CPU: 12th Gen Intel(R) Core(TM) i7-1260P
Stack Trace:
----> [1](vscode-notebook-cell:***/BERTopic_playground.ipynb#X11sZmlsZQ%3D%3D?line=0) topic_model.visualize_topics()
File [***/venv/lib/python3.10/site-packages/bertopic/_bertopic.py:2193](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/bertopic/_bertopic.py:2193), in BERTopic.visualize_topics(self, topics, top_n_topics, custom_labels, title, width, height)
2163 """ Visualize topics, their sizes, and their corresponding words
2164
2165 This visualization is highly inspired by LDAvis, a great visualization
(...)
2190 ```
2191 """
2192 check_is_fitted(self)
-> 2193 return plotting.visualize_topics(self,
2194 topics=topics,
2195 top_n_topics=top_n_topics,
2196 custom_labels=custom_labels,
2197 title=title,
2198 width=width,
2199 height=height)
File [***/venv/lib/python3.10/site-packages/bertopic/plotting/_topics.py:79](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/bertopic/plotting/_topics.py:79), in visualize_topics(topic_model, topics, top_n_topics, custom_labels, title, width, height)
77 if topic_model.topic_embeddings_ is not None:
78 embeddings = topic_model.topic_embeddings_[indices]
---> 79 embeddings = UMAP(n_neighbors=2, n_components=2, metric='cosine', random_state=42).fit_transform(embeddings)
80 else:
81 embeddings = topic_model.c_tf_idf_.toarray()[indices]
File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2887](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2887), in UMAP.fit_transform(self, X, y, force_all_finite)
2851 def fit_transform(self, X, y=None, force_all_finite=True):
2852 """Fit X into an embedded space and return that transformed
2853 output.
2854
(...)
2885 Local radii of data points in the embedding (log-transformed).
2886 """
-> 2887 self.fit(X, y, force_all_finite)
2888 if self.transform_mode == "embedding":
2889 if self.output_dens:
File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2780](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2780), in UMAP.fit(self, X, y, force_all_finite)
2776 if self.transform_mode == "embedding":
2777 epochs = (
2778 self.n_epochs_list if self.n_epochs_list is not None else self.n_epochs
2779 )
-> 2780 self.embedding_, aux_data = self._fit_embed_data(
2781 self._raw_data[index],
2782 epochs,
2783 init,
2784 random_state, # JH why raw data?
2785 )
2787 if self.n_epochs_list is not None:
2788 if "embedding_list" not in aux_data:
File [***/venv/lib/python3.10/site-packages/umap/umap_.py:2826](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:2826), in UMAP._fit_embed_data(self, X, n_epochs, init, random_state)
2822 def _fit_embed_data(self, X, n_epochs, init, random_state):
2823 """A method wrapper for simplicial_set_embedding that can be
2824 replaced by subclasses.
2825 """
-> 2826 return simplicial_set_embedding(
2827 X,
2828 self.graph_,
2829 self.n_components,
2830 self._initial_alpha,
2831 self._a,
2832 self._b,
2833 self.repulsion_strength,
2834 self.negative_sample_rate,
2835 n_epochs,
2836 init,
2837 random_state,
2838 self._input_distance_func,
2839 self._metric_kwds,
2840 self.densmap,
2841 self._densmap_kwds,
2842 self.output_dens,
2843 self._output_distance_func,
2844 self._output_metric_kwds,
2845 self.output_metric in ("euclidean", "l2"),
2846 self.random_state is None,
2847 self.verbose,
2848 tqdm_kwds=self.tqdm_kwds,
2849 )
File [***/venv/lib/python3.10/site-packages/umap/umap_.py:1106](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/umap_.py:1106), in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, densmap, densmap_kwds, output_dens, output_metric, output_metric_kwds, euclidean_output, parallel, verbose, tqdm_kwds)
1102 embedding = noisy_scale_coords(
1103 embedding, random_state, max_coord=10, noise=0.0001
1104 )
1105 elif isinstance(init, str) and init == "spectral":
-> 1106 embedding = spectral_layout(
1107 data,
1108 graph,
1109 n_components,
1110 random_state,
1111 metric=metric,
1112 metric_kwds=metric_kwds,
1113 )
1114 # We add a little noise to avoid local minima for optimization to come
1115 embedding = noisy_scale_coords(
1116 embedding, random_state, max_coord=10, noise=0.0001
1117 )
File [***/venv/lib/python3.10/site-packages/umap/spectral.py:304](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/spectral.py:304), in spectral_layout(data, graph, dim, random_state, metric, metric_kwds, tol, maxiter)
263 def spectral_layout(
264 data,
265 graph,
(...)
271 maxiter=0
272 ):
273 """
274 Given a graph compute the spectral embedding of the graph. This is
275 simply the eigenvectors of the laplacian of the graph. Here we use the
(...)
302 The spectral embedding of the graph.
303 """
--> 304 return _spectral_layout(
305 data=data,
306 graph=graph,
307 dim=dim,
308 random_state=random_state,
309 metric=metric,
310 metric_kwds=metric_kwds,
311 init="random",
312 tol=tol,
313 maxiter=maxiter
314 )
File [***/venv/lib/python3.10/site-packages/umap/spectral.py:521](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/umap/spectral.py:521), in _spectral_layout(data, graph, dim, random_state, metric, metric_kwds, init, method, tol, maxiter)
518 X[:, 0] = sqrt_deg [/](https://file+.vscode-resource.vscode-cdn.net/) np.linalg.norm(sqrt_deg)
520 if method == "eigsh":
--> 521 eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh(
522 L,
523 k,
524 which="SM",
525 ncv=num_lanczos_vectors,
526 tol=tol or 1e-4,
527 v0=np.ones(L.shape[0]),
528 maxiter=maxiter or graph.shape[0] * 5,
529 )
530 elif method == "lobpcg":
531 with warnings.catch_warnings():
File [***/venv/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605](https://file+.vscode-resource.vscode-cdn.net/***/venv/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1605), in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
1600 warnings.warn("k >= N for N * N square matrix. "
1601 "Attempting to use scipy.linalg.eigh instead.",
1602 RuntimeWarning)
1604 if issparse(A):
-> 1605 raise TypeError("Cannot use scipy.linalg.eigh for sparse A with "
1606 "k >= N. Use scipy.linalg.eigh(A.toarray()) or"
1607 " reduce k.")
1608 if isinstance(A, LinearOperator):
1609 raise TypeError("Cannot use scipy.linalg.eigh for LinearOperator "
1610 "A with k >= N.")
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
@ndettmer This might relate to the number of topics that you have. If there are few, for example less than 10, it might throw an error because it has issues reducing dimensionality on such a small dataset.
@MaartenGr thank you! I actually just took a subset of 20newsgroups. Therefore, the resulting number of topics was quite low. With a larger subset it worked.
@MaartenGr So in such cases where the number of topics is less, how would you suggest to calculate the x,y representation for the topics for visualisation purpose?
@gnanukoth I believe it is possible by lowering the n_neighbors
parameter when using UMAP which should solve the issue. I remember there being another issue discussing this with a potential solution but you would have to search the issues page with .visualize_topics
.
Noted @MaartenGr , I will search through the issues, Thanks!
Leaving the possible solution found in another issue here: link, for easy reference. Thanks!
Hello @MaartenGr, the proposed solution in the other issue also works only for No. of topics >= 4. I tried using PCA for cases where No. of topics < 4, PCA seems to work with the plotly visualization. Do you think using PCA would be meaningful for such situations? Is there something else that I should consider when using PCA for this visualization?
@gnanukoth In all honesty, these dimensionality reduction algorithms generally tend to work better the more data is available since they are trained. So if you have 4 or fewer data points, consider whether it is actually meaningful to perform it at all. I wonder how well such an algorithm can be with such a small dataset.
Hi!
When I try to run bertopic() I get the following error:
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.
I increased the number of documents to 205504, which should be enough I think.
Does someone have any idea what could cause the problem?