Open chapmanjacobd opened 1 year ago
Hi,
Could you please provide information on your OS, scitkit-learn version and numpy version? Also, can you try with non-sparse matrix?
hmm I'm not sure how to try a nonsparse matrix. I'll look into it
cat /etc/os-release
NAME="Fedora Linux"
VERSION="38 (KDE Plasma)"
...
pip freeze | grep -iE '^sci|^num'
numpy==1.23.4
scikit-learn==1.2.2
scipy==1.9.3
$ git clone https://github.com/chapmanjacobd/library
xklb/utils.py
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
────────────────────────────────────────┐
1105: def load_spacy_model(model=None): │
────────────────────────────────────────┘
def cluster_paths(paths, model=None, n_clusters=None):
nlp = load_spacy_model(model)
- from sklearn.cluster import KMeans
+ from pdc_dp_means import DPMeans
from sklearn.feature_extraction.text import TfidfVectorizer
sentence_strings = (path_to_sentence(s) for s in paths)
─────────────────────────────────────────────────────────────┐
1117: def cluster_paths(paths, model=None, n_clusters=None): │
─────────────────────────────────────────────────────────────┘
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(joined_strings)
- kmeans = KMeans(n_clusters=n_clusters or int(X.shape[0] ** 0.5), random_state=0).fit(X)
+ kmeans = DPMeans(n_clusters=n_clusters or int(X.shape[0] ** 0.5), random_state=0).fit(X)
clusters = kmeans.labels_
$ ipython --pdb -m xklb.lb cs ~/mc/tabs
Thanks, env shouldn't be an issue then.
Could you try the toy example:
from sklearn.datasets import make_blobs
from pdc_dp_means import DPMeans
# Generate sample data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)
# Apply DPMeans clustering
dpmeans = DPMeans(n_clusters=1,n_init=10, delta=10) # n_init and delta parameters
dpmeans.fit(X)
# Predict the cluster for each data point
y_dpmeans = dpmeans.predict(X)
# Plotting clusters and centroids
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=y_dpmeans, s=50, cmap='viridis')
centers = dpmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
plt.show()
This will use a non sparse matrix, and I suspect that this is the problem.
I tried to use this as a drop-in replacement for KMeans but I get an error:
Here is my code: