Closed JulienPeloton closed 3 years ago
Also, this might indicates that all_features.txt
gets wrongly computed - which might be a problem for later (even without the PCA components). Could it be something with the Gaussian noise routine that has been recently changed?
Hi Julien, I checked the code in training_set.py and found that the PCA transformation occurs in line 233.
The features file is written in line 239, so we would just need to replace the X_pca variable with coeffs. Example:
np.savetxt('pcafeatures.txt',np.c[classes,np.arange(1,n_class*4+1),coeffs[:,:47]],fmt='%s')
In short, the error you're getting is due to lines 233-236 in training_set.py, because even though PCA has been turned off in the other files, creating a training set by default creates two files, the file with the raw statistics and the pca file with the principal components. The pca file doesn't have to be used though, as the microlensing_classifier.py code has indeed been modified to train the classifier according to the raw statistics alone. Therefore in principle there is no need to modify the training_set.py code.
As to why you are getting error, I could not replicate on my end. I downloaded this branch and successfully created a training set the same way you tried. Can you try again? If not please email me the directory you are using. Cheers.
Hi Daniel -- thanks for the quick reply. My point is that to perform a PCA decomposition, the code needs the features - and this is those features that causes the error.
In the branch FINK, the features are written in line 221-228
then loaded back in line 232
and finally the coeffs are used in the PCA in line 233-234
In my case, the error happens line 234, when during the fit (the code complains that the coeffs contain NaNs or Inf). So my point is that, PCA file on aside, the features file itself seems wrong.
I am strictly using the FINK branch without any further modification (installed using pip install git+https://github.com/dgodinez77/LIA.git@FINK
). Maybe to help reproducing the error, here are the versions of packages I'm using:
macOS 11.2.1
Python 3.7.1
astropy: 4.0.2
numpy: 1.19.5
sklearn: 0.22
scipy: 1.4.1
Actually, here is the all_features.txt
file obtained after running the code snippet in my first message, and the code that gives the error:
from sklearn import decomposition
import numpy as np
# extracted from training_set.create line 232
coeffs = np.loadtxt('all_features.txt',usecols=np.arange(2,49))
pca = decomposition.PCA(n_components=47, whiten=True, svd_solver='auto')
pca.fit(coeffs)
And looking at the data:
coeffs = np.loadtxt('all_features.txt',usecols=np.arange(2,49), dtype=np.float)
print(np.sum(np.isfinite(coeffs)))
# 93930
print(np.shape(coeffs)[0] * np.shape(coeffs)[1])
# 94000
mask = np.isfinite(coeffs)
print(coeffs[~mask])
[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
One way to remedy to this would be to flag nan
in extract_features.extract_all
as is done for inf
currently:
But it would be good to understand why those nan
are suddenly showing up while they are not here in the master
branch
Hi Julien
I will investigate this and come back to you.
Hi Julien
I just pushed a fix to the branch that should do the trick.
The reason of nans was the auto_corr function, which automatically returns Nan if the variance of the data are small. I replace the function with one using numpy and it should now working properly. Please let me know if there is other issues.
The reason we did not see this in the master branch is because the noise model was not accurate, leading to high noise. Therefore we never see small variance in the lightcurves.
Thanks @ebachelet, the code does not crash anymore with the fix. I will retrain model and report on the performance change on a different issue.
And thanks @dgodinez77 for the help!
Hi - I recently switched to the
FINK
branch, and retrained models. However, when creating a training dataset, the code crashes withValueError
. Here is a minimal example (that used to work) to reproduce the bug:and here is the traceback:
The error is linked to the PCA - but I thought it was switched off with this branch? Any ideas?