AlexanderFabisch / gmr

Gaussian Mixture Regression
https://alexanderfabisch.github.io/gmr/
BSD 3-Clause "New" or "Revised" License
168 stars 49 forks source link

[MRG] Fix numerical issue in expectation step #31

Closed AlexanderFabisch closed 3 years ago

AlexanderFabisch commented 3 years ago

Fixes #30

AlexanderFabisch commented 3 years ago

@mralbu If you have no time for this, it is OK, but I'd like to ask you one last time whether you can take a look at these fixes. With the test code from #30 it looks quite good now. With random initialization I get scores similar to sklearn. sklearn is also able to initialize from KMeans, which gives slightly better results than kmeans++ initialization, but it is out of scope for this library.

mralbu commented 3 years ago

Using test code from #30, and the experimental GMMRegressor with sklearn internals.

Results look good, scores from random init are practically the same! Did some timings as well, and sklearn internals are quite a bit faster. In particular, kmeans initialization is faster than random init for the experimental GMMRegressor, but not for gmr.

import numpy as np
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)

np.set_printoptions(precision=2, suppress=True)

np.random.seed(2)

start = time.time()
scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(
        n_components=2, init_params="kmeans++")
    gmr.fit(X, y)
    score = gmr.score(X, y)
    scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.81 0.83 0.77 0.82 0.83 0.83 0.77 0.83 0.82 0.83] in 3.108 seconds

np.random.seed(2)

start = time.time()
scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressorSklearn(
        n_components=2, init_params="random")
    gmr.fit(X, y)
    score = gmr.score(X, y)
    scores.append(score)
print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.82 0.81 0.82 0.82 0.82 0.81 0.85 0.83 0.83 0.78] in 2.877 seconds

from sklego.mixture import GMMRegressor

np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params="random")
    gmr.fit(X, y)
    score = gmr.score(X, y)
    scores.append(score)

print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.82 0.81 0.82 0.82 0.82 0.82 0.85 0.83 0.83 0.78] in 0.091 seconds 

np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params="kmeans")
    gmr.fit(X, y)
    score = gmr.score(X, y)
    scores.append(score)

print(str(np.array(scores)) + f" in {(time.time() - start):.3f} seconds")
>> [0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85 0.85] in 0.042 seconds
AlexanderFabisch commented 3 years ago

Did some timings as well, and sklearn internals are quite a bit faster. In particular, kmeans initialization is faster than random init for the experimental GMMRegressor, but not for gmr.

Thanks for checking.

I think I don't understand this:

  1. GaussianMixtureRegressor + kmeans++: 3.1 s
  2. GaussianMixtureRegressorSklearn + random: 2.9 s
  3. GMMRegressor + random: 0.09 s
  4. GMMRegressor + kmeans: 0.04 s

Number 2 and 3 should give very similar results, shouldn't they?

mralbu commented 3 years ago

That's true! I think the bulk of the difference is in the prediction step. Reran the benchmarks commenting out the prediction, and the timings for 2 and 3 are very similar:

np.random.seed(2)

start = time.time()
scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressor(
        n_components=2, init_params="kmeans++")
    gmr.fit(X, y)

print(f" in {(time.time() - start):.3f} seconds")
>> in 0.163 seconds

np.random.seed(2)

start = time.time()
scores = []
for _ in range(10):
    gmr = GaussianMixtureRegressorSklearn(
        n_components=2, init_params="random")
    gmr.fit(X, y)

print(f" in {(time.time() - start):.3f} seconds")
>> in 0.090 seconds 

from sklego.mixture import GMMRegressor

np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params="random")
    gmr.fit(X, y)

print(f" in {(time.time() - start):.3f} seconds")
>> in 0.082 seconds

np.random.seed(2)
start = time.time()
scores = []
for _ in range(10):
    gmr = GMMRegressor(n_components=2, init_params="kmeans")
    gmr.fit(X, y)

print(f" in {(time.time() - start):.3f} seconds")
>> in 0.034 seconds
AlexanderFabisch commented 3 years ago

OK, thanks! I will merge it then soon.

AlexanderFabisch commented 3 years ago

I'll probably also include a prediction function similar to yours in the future as it is so much faster (if that is OK for you and licenses are compatible). :)

Ticket #32

mralbu commented 3 years ago

Sure, that's ok! You can use any of the code in GMMRegressor, even all of it if you think it would be a good adition. I thought it would be interesting to be able to use other sklearn.mixture.GaussianMixture options, such as different covariance_types and the BayesianGaussianMixture.

AlexanderFabisch commented 3 years ago

Thanks again @mralbu