How to cluster my own Dataset

hammad2008 commented 1 year ago

@DeMoriarty I hope you are doing well. I have the following Dataset. with four features I want to cluster the data.

Can you please provide a code for this one. I need to run it into anaconda

I will be very thankful to you

Dataset.csv

hammad2008 commented 1 year ago

@TheProjectsGuy @DeMoriarty Can you please help me with that?

hammad2008 commented 1 year ago

@DeMoriarty Can you please help me with this?

DeMoriarty commented 1 year ago

you can read and parse your file using the csv package, and represent your data as a nested list of floats, convert it into a pytorch tensor with torch.tensor(nested_list, device=device) then fit the kmeans clusterer with this tensor. You may also need to normalize/standardize your columns, so that all 4 features have similar ranges, in order to get better clustering results. The exact ways of reading csv files or doing feature preprocessing is beyond the scope of this repository, but there are a lot of useful resources on the internet.

hammad2008 commented 1 year ago

@DeMoriarty, I just want to use two features 2nd column and 4th column. Then apply this K-means repo, can you please modified the code for that. I will be very thankful. I have tried many methods but still does not get result always get errors

DeMoriarty commented 1 year ago

there are many ways of choosing specific columns, if your data is already in numpy.ndarray or torch.tensor format, you can do:

indices = [1, 3]
data = data[:, indices ]

which will select the columns with specified indices.

hammad2008 commented 1 year ago

But it does not work on your repo. The dataset has no applied using your kmeans function

DeMoriarty commented 1 year ago

can you provide a minimal code to reproduce the error?

hammad2008 commented 1 year ago

@DeMoriarty Here is the following Code

and the error is AttributeError: 'numpy.ndarray' object has no attribute 'device'

import numpy as np
array = np.loadtxt('Dataset.csv', delimiter=',')

twofeaturs = array[:, [1, 3]]
print(twofeaturs)

from fast_pytorch_kmeans import KMeans
import torch

kmeans = KMeans(n_clusters=4, mode='euclidean', verbose=1)

labels = kmeans.fit_predict(twofeaturs)

after solving the error i want to plot the data into different cluster

DeMoriarty commented 1 year ago

fast_pytorch_kmeans only accepts pytorch tensors as input. all you have to do is:

twofeatures = torch.from_numpy(two_features)

and if you want to run the kmeans algorithm on gpu, you need to:

twofeatures = twofeatures.cuda()

before calling kmeans.fit_predict().

for plotting data with labels, see #7

hammad2008 commented 1 year ago

@DeMoriarty when i use the following code it gives the error UnboundLocalError: local variable 'expected' referenced before assignment

from fast_pytorch_kmeans import KMeans
import torch
twofeaturs = torch.from_numpy(twofeaturs)
twofeaturs = twofeaturs.cuda()
kmeans = KMeans(n_clusters=8, mode='euclidean', verbose=1)
labels = kmeans.fit_predict(twofeaturs)

hammad2008 commented 1 year ago

@DeMoriarty The following is the all code but i have get the error, as shown below can you please check it

UnboundLocalError: local variable 'expected' referenced before assignment

from fast_pytorch_kmeans import KMeans
import torch
import numpy as np
array = np.loadtxt('Dataset.csv', delimiter=',')

twofeatures = array[:, [1, 3]]
print(twofeatures)
torchfeatures = torch.from_numpy(twofeatures)
gpufeature = torchfeatures.cuda()
kmeans = KMeans(n_clusters=8, mode='euclidean', verbose=1)
labels = kmeans.fit_predict(gpufeature)
plt.scatter(array[:, 0], data[:, 1], c=labels, s=10, cmap="jet")
centroids = kmeans.centroids
plt.scatter(centroids[:, 0], centroids[:, 1], s=70)
plt.show()

DeMoriarty commented 1 year ago

that seems to be a bug in the KMeans, because it only recognizes float16 or float32 data, and your data is probably in float64. for a quick solution you can just convert your data into float32 or float16. to do so:

# float32
twofeature = twofeature.to(torch.float)
# or float16
twofeature = twofeature.to(torch.half)

in the meantime I will fix this bug and also add some assertions to make sure that user input is pytorch tensor. thanks for your contribution.

hammad2008 commented 1 year ago

@DeMoriarty After that now getting the following error TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

from fast_pytorch_kmeans import KMeans
import torch
import numpy as np
import matplotlib.pyplot as plt

array = np.loadtxt('Dataset.csv', delimiter=',')

twofeatures = array[:, [1, 3]]
print(twofeatures)
torchfeatures = torch.from_numpy(twofeatures)
gpufeature = torchfeatures.cuda()
gpufeature = gpufeature.to(torch.float)
kmeans = KMeans(n_clusters=8, mode='euclidean', verbose=1)
labels = kmeans.fit_predict(gpufeature)
plt.scatter(array[:, 0], array[:, 1], c=labels, s=10, cmap="jet")
centroids = kmeans.centroids
plt.scatter(centroids[:, 0], centroids[:, 1], s=70)
plt.show()

hammad2008 commented 1 year ago

@DeMoriarty Now getting this error ValueError: 'c' argument has 1 elements, which is inconsistent with 'x' and 'y' with size 6610.

from fast_pytorch_kmeans import KMeans
import torch
import numpy as np
import matplotlib.pyplot as plt

array = np.loadtxt('Dataset.csv', delimiter=',')

twofeatures = array[:, [1, 3]]
print(twofeatures)
torchfeatures = torch.from_numpy(twofeatures)
gpufeature = torchfeatures.cuda()
gpufeature = gpufeature.to(torch.float)
kmeans = KMeans(n_clusters=4, mode='euclidean', verbose=1)
labels = kmeans.fit_predict(gpufeature)
index = labels.cpu().data.numpy().argmax()
print(index)
plt.scatter(array[:, [0]], array[:, [1]], c=index, s=10, cmap="jet")
centroids = kmeans.centroids
plt.scatter(centroids[:, 0], centroids[:, 1], s=70)
plt.show()

DeMoriarty commented 1 year ago

does this work?

plt.scatter(twofeatures[:, 0], twofeatures[:, 1], c=labels.cpu(), s=10, cmap="jet")

hammad2008 commented 1 year ago

@DeMoriarty no it does not work

DeMoriarty commented 1 year ago

@hammad2008 what is the error message?

hammad2008 commented 1 year ago

@DeMoriarty The following are my full code. The error is

ValueError: 'c' argument has 1 elements, which is inconsistent with 'x' and 'y' with size 6610.

from fast_pytorch_kmeans import KMeans
import torch
import numpy as np
import matplotlib.pyplot as plt

array = np.loadtxt('Dataset.csv', delimiter=',')

twofeatures = array[:, [1, 3]]
print(twofeatures)
torchfeatures = torch.from_numpy(twofeatures)
gpufeature = torchfeatures.cuda()
gpufeature = gpufeature.to(torch.float)
kmeans = KMeans(n_clusters=4, mode='euclidean', verbose=1)
labels = kmeans.fit_predict(gpufeature)
print(labels)
index = labels.cpu().data.numpy().argmax()
print(index)
plt.scatter(array[:, [0]], array[:, [1]], c=index, s=10, cmap="jet")
centroids = kmeans.centroids
plt.scatter(centroids[:, 0], centroids[:, 1], s=70)
plt.show()

DeMoriarty commented 1 year ago

it's giving that error because you're taking argmax of the labels.

hammad2008 commented 1 year ago

@DeMoriarty Can you please modified it and run it?

DeMoriarty commented 1 year ago

have you tried replacing

plt.scatter(array[:, [0]], array[:, [1]], c=index, s=10, cmap="jet")

with

plt.scatter(twofeatures[:, 0], twofeatures[:, 1], c=labels.cpu(), s=10, cmap="jet")

hammad2008 commented 1 year ago

@DeMoriarty It also does not work.

DeMoriarty commented 1 year ago

@hammad2008 and the error message is?

hammad2008 commented 1 year ago

@DeMoriarty AttributeError: 'numpy.ndarray' object has no attribute 'cpu'

DeMoriarty commented 1 year ago

@hammad2008 then just remove .cpu()

DeMoriarty commented 1 year ago

plt.scatter(twofeatures[:, 0], twofeatures[:, 1], c=labels, s=10, cmap="jet")

DeMoriarty commented 1 year ago

closing since no further issues reported

DeMoriarty / fast_pytorch_kmeans

How to cluster my own Dataset #9