AlexanderFabisch / gmr

Gaussian Mixture Regression
https://alexanderfabisch.github.io/gmr/
BSD 3-Clause "New" or "Revised" License
172 stars 49 forks source link

NaN value for gmm.predict? #5

Closed kroscek closed 7 years ago

kroscek commented 7 years ago

Hi. I use my dataset when using gmr. My dataset named train is (188318 rows and 14 coloumns) and test is (122000 and 14 coloumns). My label is y_train (188318 rows,) then, after following from the regression example you provide: gmm.from_samples(train) Y = gmm.predict(np.array([0]), y_train[:, np.newaxis])

Not sure why it throws the NaN values? Usually we fit the model using train and y_train, then we predict using the test data right?

AlexanderFabisch commented 7 years ago

Hi,

can you create a minimal example (with a smaller dataset) to reproduce the error? I must reproduce the error on my machine to fix this.

kroscek commented 7 years ago

from sklearn.datasets import load_boston from sklearn.preprocessing import StandardScaler boston = load_boston() X, y = boston.data, boston.target transformed_X = StandardScaler(with_mean=True, with_std=True) X_train = transformed_X.fit_transform(X) gmm = GMM(n_components=10, random_state=2016, verbose=1) gmm.from_samples(X_train_sc) Y = gmm.predict(np.array([0]), y[:, np.newaxis])

If I'm not wrong, the problem lies on rescaling issue? Using Boston data, It works by rescaling the X (and Y if wanted to). And one more question is, we don't need to use the test data for prediction?

AlexanderFabisch commented 7 years ago

I minimal example that reproduces the error would be:

import numpy as np
from gmr import GMM
from sklearn.datasets import load_boston
boston = load_boston()
X, y = boston.data, boston.target
gmm = GMM(n_components=10, random_state=2016, verbose=1)
gmm.from_samples(X)
Y = gmm.predict(np.array([0]), y[:, np.newaxis])

What you want to predict is your choice. I cannot say what makes sense in your use case. If your use case is to demonstrate the usage of this library it is totally OK to predict the test data. :)

AlexanderFabisch commented 7 years ago

Good catch.

For your information: it was a floating point precision problem: probabilities were almost 0 but could not be represented as a float any more, which resulted in a division by 0. Should be fixed in the latest commit of this repository. Could you verify that?