Closed AlexanderFabisch closed 3 years ago
@mralbu If you have no time for this, it is OK, but I'd like to ask you one last time whether you can take a look at these fixes. With the test code from #30 it looks quite good now. With random initialization I get scores similar to sklearn. sklearn is also able to initialize from KMeans, which gives slightly better results than kmeans++ initialization, but it is out of scope for this library.
Using test code from #30, and the experimental GMMRegressor with sklearn internals.
Results look good, scores from random init are practically the same! Did some timings as well, and sklearn internals are quite a bit faster. In particular, kmeans initialization is faster than random init for the experimental GMMRegressor, but not for gmr.
import numpy as np
from sklearn.datasets import load_boston
X, y = load_boston(return_X_y=True)
np.set_printoptions(precision=2, suppress=True)
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GaussianMixtureRegressor(
n_components=2, init_params="kmeans++")
gmr.fit(X, y)
score = gmr.score(X, y)
scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.81 0.83 0.77 0.82 0.83 0.83 0.77 0.83 0.82 0.83] in 3.108 seconds
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GaussianMixtureRegressorSklearn(
n_components=2, init_params="random")
gmr.fit(X, y)
score = gmr.score(X, y)
scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.82 0.81 0.82 0.82 0.82 0.81 0.85 0.83 0.83 0.78] in 2.877 seconds
from sklego.mixture import GMMRegressor
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GMMRegressor(n_components=2, init_params="random")
gmr.fit(X, y)
score = gmr.score(X, y)
scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.82 0.81 0.82 0.82 0.82 0.82 0.85 0.83 0.83 0.78] in 0.091 seconds
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GMMRegressor(n_components=2, init_params="kmeans")
gmr.fit(X, y)
score = gmr.score(X, y)
scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85] in 0.042 seconds
Did some timings as well, and sklearn internals are quite a bit faster. In particular, kmeans initialization is faster than random init for the experimental GMMRegressor, but not for gmr.
Thanks for checking.
I think I don't understand this:
Number 2 and 3 should give very similar results, shouldn't they?
That's true! I think the bulk of the difference is in the prediction step. Reran the benchmarks commenting out the prediction, and the timings for 2 and 3 are very similar:
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GaussianMixtureRegressor(
n_components=2, init_params="kmeans++")
gmr.fit(X, y)
print(f" in {(time.time() - start):.3f} seconds")
>> in 0.163 seconds
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GaussianMixtureRegressorSklearn(
n_components=2, init_params="random")
gmr.fit(X, y)
print(f" in {(time.time() - start):.3f} seconds")
>> in 0.090 seconds
from sklego.mixture import GMMRegressor
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GMMRegressor(n_components=2, init_params="random")
gmr.fit(X, y)
print(f" in {(time.time() - start):.3f} seconds")
>> in 0.082 seconds
np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
gmr = GMMRegressor(n_components=2, init_params="kmeans")
gmr.fit(X, y)
print(f" in {(time.time() - start):.3f} seconds")
>> in 0.034 seconds
OK, thanks! I will merge it then soon.
I'll probably also include a prediction function similar to yours in the future as it is so much faster (if that is OK for you and licenses are compatible). :)
Ticket #32
Sure, that's ok! You can use any of the code in GMMRegressor, even all of it if you think it would be a good adition. I thought it would be interesting to be able to use other sklearn.mixture.GaussianMixture options, such as different covariance_types and the BayesianGaussianMixture.
Thanks again @mralbu
Fixes #30