BCG-X-Official / sklearndf

DataFrame support for scikit-learn.
https://bcg-x-official.github.io/sklearndf/
Apache License 2.0
63 stars 7 forks source link

API: add support for sklearn clustering algorithms #192

Closed mtsokol closed 2 years ago

mtsokol commented 2 years ago

Hi @j-ittner @joerg-schneider!

Thank you for your guidance!

Here's a work-in-progress PR with clusterers support, for which I followed existing implementation. Please review if that's the right direction to go.

Here all clusterers are supported that implement ClusterMixin. For tomorrow I plan to add separate internal clustering wrappers for two KMeans methods to also wrap cluster_centers_ in a dataframe.

Also I've got a few questions about this design. One of them is that extending LearnerDF by ClustererDF imposes presence of score method which is available in e.g. KMeans but not in AgglomerativeClustering. Other approach would be to remove score from LearnerDF and introduce LearnerWithScoringDF in the inheritance hierarchy but for the first POC I wanted to modify as little existing codebase as possible and wait for your review.

Here's short interaction with this implementation: labels_ preserves indexes and columns of DF used for fitting, also predict does:

>>> import sklearn
>>> from sklearn.cluster import KMeans
>>> from sklearndf.clustering import KMeansDF
>>> import numpy as np
>>> import pandas as pd
>>>
>>> data = np.array([[1, 1], [1, 4], [1, 0], [10, 3], [10, 4], [10, 0]])
>>> columns = ['bmi', 'age']
>>> index = ['a', 'b', 'c', 'd', 'e', 'f']
>>> X = pd.DataFrame(data=data, columns=columns, index=index)
>>> X_test = pd.DataFrame(data=[[0, 0], [5, 5]], columns=columns, index=['t', 's'])
>>>
>>> kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
>>> kmeans.labels_
array([1, 1, 1, 0, 0, 0], dtype=int32)
>>> kmeans.predict(X_test)
array([1, 1], dtype=int32)
>>>
>>> kmeansdf = KMeansDF(n_clusters=2, random_state=42).fit(X)
>>> kmeansdf.labels_
a    1
b    1
c    1
d    0
e    0
f    0
Name: labels, dtype: int32
>>> kmeansdf.predict(X_test)
t    1
s    1
Name: prediction, dtype: int32
mtsokol commented 2 years ago

Hi @j-ittner, I rebased it to the latest 2.0.x and now inheritance is fixed. PR is ready for another round of review!

mtsokol commented 2 years ago

Hi @j-ittner, I introduced all of the changes from your last review - it's ready for another one.