Professor-G / MicroLIA

Gravitational microlensing classification engine using machine learning
GNU General Public License v3.0
12 stars 6 forks source link

training_set.create returns ValueError #6

Closed JulienPeloton closed 3 years ago

JulienPeloton commented 3 years ago

Hi - I recently switched to the FINK branch, and retrained models. However, when creating a training dataset, the code crashes with ValueError. Here is a minimal example (that used to work) to reproduce the bug:

import numpy as np
from LIA import training_set

time = np.array(
    [
        58206.13548, 58344.48149, 58350.46324, 58362.49937, 58365.44464, 58368.49841,
        58371.50412, 58374.44425, 58377.52166, 58380.48466, 58383.48248, 58384.42387,
        58386.49883, 58389.42224, 58397.46396, 58422.42048, 58425.37649, 58428.33483,
        58431.41634, 58434.39728, 58437.42243, 58441.44642, 58462.34087, 58469.1767,
        58472.23556, 58481.26917, 58487.20136, 58491.19377, 58495.23735, 58503.21367,
        58507.19662, 58510.18974, 58514.1699,  58523.1465,  58534.17349
    ]
)

training_set.create([time], min_mag=10, max_mag=25, noise=None, n_class=500)

and here is the traceback:

Now simulating variables...
Variables successfully simulated
Now simulating constants...
Constants successfully simulated
Now simulating CV...
CVs successfully simulated
Now simulating microlensing...
Microlensing events successfully simulated
Writing files...
Saving features...
Computing principal components...

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-151bf43258ed> in <module>
     13 )
     14 
---> 15 training_set.create([time], min_mag=10, max_mag=25, noise=None, n_class=500)

~/anaconda3/lib/python3.7/site-packages/LIA/training_set.py in create(timestamps, min_mag, max_mag, noise, n_class, ml_n1, cv_n1, cv_n2, t0_dist, u0_dist, tE_dist)
    232     coeffs = np.loadtxt('all_features.txt',usecols=np.arange(2,49))
    233     pca = decomposition.PCA(n_components=47, whiten=True, svd_solver='auto')
--> 234     pca.fit(coeffs)
    235     #feat_strengths = pca.explained_variance_ratio_
    236     X_pca = pca.transform(coeffs)

~/anaconda3/lib/python3.7/site-packages/sklearn/decomposition/_pca.py in fit(self, X, y)
    342             Returns the instance itself.
    343         """
--> 344         self._fit(X)
    345         return self
    346 

~/anaconda3/lib/python3.7/site-packages/sklearn/decomposition/_pca.py in _fit(self, X)
    389 
    390         X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=True,
--> 391                         copy=self.copy)
    392 
    393         # Handle n_components==None

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    560         if force_all_finite:
    561             _assert_all_finite(array,
--> 562                                allow_nan=force_all_finite == 'allow-nan')
    563 
    564     if ensure_min_samples > 0:

~/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     58                     msg_err.format
     59                     (type_err,
---> 60                      msg_dtype if msg_dtype is not None else X.dtype)
     61             )
     62     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

The error is linked to the PCA - but I thought it was switched off with this branch? Any ideas?

JulienPeloton commented 3 years ago

Also, this might indicates that all_features.txt gets wrongly computed - which might be a problem for later (even without the PCA components). Could it be something with the Gaussian noise routine that has been recently changed?

Professor-G commented 3 years ago

Hi Julien, I checked the code in training_set.py and found that the PCA transformation occurs in line 233.

The features file is written in line 239, so we would just need to replace the X_pca variable with coeffs. Example:

np.savetxt('pcafeatures.txt',np.c[classes,np.arange(1,n_class*4+1),coeffs[:,:47]],fmt='%s')

In short, the error you're getting is due to lines 233-236 in training_set.py, because even though PCA has been turned off in the other files, creating a training set by default creates two files, the file with the raw statistics and the pca file with the principal components. The pca file doesn't have to be used though, as the microlensing_classifier.py code has indeed been modified to train the classifier according to the raw statistics alone. Therefore in principle there is no need to modify the training_set.py code.

As to why you are getting error, I could not replicate on my end. I downloaded this branch and successfully created a training set the same way you tried. Can you try again? If not please email me the directory you are using. Cheers.

JulienPeloton commented 3 years ago

Hi Daniel -- thanks for the quick reply. My point is that to perform a PCA decomposition, the code needs the features - and this is those features that causes the error.

In the branch FINK, the features are written in line 221-228

https://github.com/dgodinez77/LIA/blob/9eec601bd084885b693f24c42f2c26d5e8aeb454/LIA/training_set.py#L221-L228

then loaded back in line 232

https://github.com/dgodinez77/LIA/blob/9eec601bd084885b693f24c42f2c26d5e8aeb454/LIA/training_set.py#L232

and finally the coeffs are used in the PCA in line 233-234

https://github.com/dgodinez77/LIA/blob/9eec601bd084885b693f24c42f2c26d5e8aeb454/LIA/training_set.py#L233-L234

In my case, the error happens line 234, when during the fit (the code complains that the coeffs contain NaNs or Inf). So my point is that, PCA file on aside, the features file itself seems wrong.

I am strictly using the FINK branch without any further modification (installed using pip install git+https://github.com/dgodinez77/LIA.git@FINK). Maybe to help reproducing the error, here are the versions of packages I'm using:

macOS 11.2.1
Python 3.7.1

astropy: 4.0.2
numpy: 1.19.5
sklearn: 0.22
scipy: 1.4.1
JulienPeloton commented 3 years ago

Actually, here is the all_features.txt file obtained after running the code snippet in my first message, and the code that gives the error:

from sklearn import decomposition
import numpy as np

# extracted from training_set.create line 232
coeffs = np.loadtxt('all_features.txt',usecols=np.arange(2,49))
pca = decomposition.PCA(n_components=47, whiten=True, svd_solver='auto')
pca.fit(coeffs)

all_features.txt

And looking at the data:

coeffs = np.loadtxt('all_features.txt',usecols=np.arange(2,49), dtype=np.float)

print(np.sum(np.isfinite(coeffs)))
# 93930
print(np.shape(coeffs)[0] * np.shape(coeffs)[1])
# 94000

mask = np.isfinite(coeffs)
print(coeffs[~mask])
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
JulienPeloton commented 3 years ago

One way to remedy to this would be to flag nan in extract_features.extract_all as is done for inf currently:

https://github.com/dgodinez77/LIA/blob/9eec601bd084885b693f24c42f2c26d5e8aeb454/LIA/extract_features.py#L66

But it would be good to understand why those nan are suddenly showing up while they are not here in the master branch

ebachelet commented 3 years ago

Hi Julien

I will investigate this and come back to you.

ebachelet commented 3 years ago

Hi Julien

I just pushed a fix to the branch that should do the trick.

The reason of nans was the auto_corr function, which automatically returns Nan if the variance of the data are small. I replace the function with one using numpy and it should now working properly. Please let me know if there is other issues.

The reason we did not see this in the master branch is because the noise model was not accurate, leading to high noise. Therefore we never see small variance in the lightcurves.

JulienPeloton commented 3 years ago

Thanks @ebachelet, the code does not crash anymore with the fix. I will retrain model and report on the performance change on a different issue.

And thanks @dgodinez77 for the help!