ddbourgin / numpy-ml

Machine learning, in numpy
https://numpy-ml.readthedocs.io/
GNU General Public License v3.0
15.29k stars 3.71k forks source link

Can I write a K-means model? then pull request. #31

Closed daidai21 closed 5 years ago

daidai21 commented 5 years ago

I can't find K-means model, so I think I can coding one. Thanks!

ddbourgin commented 5 years ago

Hi @daidai21 - thanks for your interest!

There actually is a k-means model as part of the KNN module, though I haven't explicitly called it that in the READMEs. Specifically, the KNN object takes an argument classifier, which converts between k-nearest neighbors regression (classifier=False) and k-means classification/clustering (classifier=True).

Feel free to propose other models you'd be interested in working on, though!

daidai21 commented 5 years ago

I'm sorry. Excuse me. There is no SVM? @ddbourgin

ddbourgin commented 5 years ago

@daidai21 - No need to apologize! An SVM implementation would be awesome -- it's been on my TODO list for ages :)

The crux will be implementing the SMO algorithm properly I suspect. If you decide to do it, I wouldn't worry too much about being efficient - for this repo, the focus is more on making everything as clean/clear as possible rather than on being clever.

Also, if you end up referencing other implementations when writing your code, please make sure to cite them in the docstrings and PR. It's important that any code you submit is your own work.

Finally - thanks! Let me know if you have any questions as you go along :)

daidai21 commented 5 years ago

OK,I decide try to coding this algorithm. But I think I need some time, because I have other job. I will finish it as soon as possible.

Nice to meet you. @ddbourgin

ddbourgin commented 5 years ago

Sure, take your time, and let me know if you have any questions!

daidai21 commented 5 years ago

Hi, David

I took time to finish it, but the test didn't pass all. There is a 78% probability that my model and Sklearn's model predict the accuracy of the results. I don't know what to do now?

Sometimes my models are good, sometimes sklearns are good.

I think this result is related to the distribution of randomly generated data. I think my code is OK. What do you think?

This is test code.

import warnings
warnings.filterwarnings('ignore')
import numpy as np
import random

# load myself model
# from SVM import SVM

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets.samples_generator import make_blobs
from sklearn.model_selection import train_test_split

def test_SVM():
    i = 1
    np.random.seed(12345)
    while True:
        X, Y = make_blobs(  # generate dataset
            n_samples=np.random.randint(2, 100), 
            n_features=np.random.randint(2, 100),
            centers=2, random_state=i, 
        )
        X, X_test, Y, Y_test = train_test_split(X, Y, test_size=0.3, random_state=i)
        if 0 not in Y or 1 not in Y:  # ignore split error(train/test data only 1 class)
            continue
        # generate param
        C = random.uniform(0.1, 0.9)
        max_iter = random.uniform(50, 500)
        kernel = np.random.choice(["linear", "rbf"])
        tol = random.uniform(0.000001, 0.1)
        # fit and predict
        clf1 = SVC(C=C, max_iter=max_iter, kernel=kernel, tol=tol)
        clf1.fit(X, Y)
        pred1 = clf1.predict(X_test)
        clf2 = SVM(C=C, max_iter=max_iter, kernel=kernel, tol=tol)
        clf2.fit(X, Y)
        pred2 = clf2.predict(X_test)
        # judge
        # err_msg = "ERROR {0} {1}".format(accuracy_score(Y_test, pred1), accuracy_score(Y_test, pred2))
        # assert accuracy_score(Y_test, pred1) == accuracy_score(Y_test, pred2), err_msg
        # print("PASSED")
        if accuracy_score(Y_test, pred1) == accuracy_score(Y_test, pred2):
            print("PASSED")
        else:
            print("ERROR", accuracy_score(Y_test, pred1), accuracy_score(Y_test, pred2))

if __name__ == "__main__":
    test_SVM()

This test code run result.

PASSED
PASSED
PASSED
PASSED
ERROR 0.3333333333333333 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.0
PASSED
ERROR 0.3125 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.7692307692307693
PASSED
PASSED
ERROR 1.0 0.5
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 0.9655172413793104 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 0.5 1.0
PASSED
ERROR 1.0 0.5384615384615384
PASSED
ERROR 1.0 0.9090909090909091
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.3333333333333333
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.75
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.9
ERROR 1.0 0.6666666666666666
PASSED
ERROR 0.3333333333333333 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 0.4444444444444444 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 0.14285714285714285 1.0
ERROR 1.0 0.8
PASSED
PASSED
ERROR 1.0 0.9583333333333334
PASSED
ERROR 1.0 0.3333333333333333
PASSED
ERROR 1.0 0.9047619047619048
ERROR 0.0 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 0.2 1.0
PASSED
PASSED
ERROR 1.0 0.6
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.9090909090909091
ERROR 0.0 1.0
ERROR 0.3333333333333333 1.0
ERROR 1.0 0.6
PASSED
PASSED
PASSED
PASSED
ERROR 0.5 1.0
ERROR 1.0 0.8
ERROR 1.0 0.9523809523809523
PASSED
ERROR 0.32 1.0
PASSED
PASSED
ERROR 1.0 0.8333333333333334
ERROR 1.0 0.9259259259259259
ERROR 1.0 0.96
PASSED
PASSED
PASSED
ERROR 1.0 0.9259259259259259
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.5
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.4
PASSED
ERROR 0.4 1.0
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.6666666666666666
PASSED
PASSED
ERROR 0.4166666666666667 1.0
ERROR 1.0 0.9166666666666666
PASSED
ERROR 1.0 0.6666666666666666
PASSED
PASSED
ERROR 1.0 0.6
PASSED
PASSED
ERROR 0.3333333333333333 1.0
PASSED
ERROR 0.4 1.0
ERROR 0.8235294117647058 1.0
PASSED
PASSED
PASSED
PASSED
ERROR 0.5555555555555556 1.0
PASSED
PASSED
ERROR 0.0 1.0
PASSED
PASSED
PASSED
PASSED
ERROR 1.0 0.5
PASSED
PASSED
ddbourgin commented 5 years ago

Hi @daidai21 - thank you for working on this! It's not clear to me why random data generation would result in failed tests, since both models receive the same input data and targets. Perhaps I'm missing something?

Anyway, feel free to submit a PR and we can try to work through the code together to identify what's going on. It's difficult to know right now why certain tests aren't passing, since I don't know what the model code looks like.

Finally, to help track down the cause of the failed tests, I'd recommend directly comparing pred1 and pred2 to ensure that individual data points are being categorized in the same way between the two models. This will help you to better identify why some of the tests are failing :)

Thanks again!

ddbourgin commented 5 years ago

Closing this, as the code you are talking about is not your own work.

See https://github.com/ddbourgin/numpy-ml/pull/37