erdogant / pca

pca: A Python Package for Principal Component Analysis.
https://erdogant.github.io/pca
MIT License
284 stars 42 forks source link

Defining Necessary Number of Dimensions #32

Open BradKML opened 1 year ago

BradKML commented 1 year ago

For fun I also borrowed some other data from This Link and see how personality and test performance can be condensed to a dimensionally reduced model. personality_score.csv

Question1 : what is the proper way of selecting sufficient amont of dimensions to preserve data and avoiding noise? Kaiser–Meyer–Olkin, Levene, and others all seem to be better descriptors compared to "Eigenvalue > 1" rule. Question 2: Can PCA be integrated with something else such that it can behave like PCR and Lasso Regression? (as in reducing the amounf of unnecessary columns before attempting to be accurate) Questions 3: Can ICA be used to discover significant columns? It is seen as A way to isolate components after using PCA to assess proper dimension count

!pip install pca
from pandas import read_csv
from pca import pca

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
y = df[['AFQT']]
X = df.drop(columns=['AFQT'])
model = pca(normalize=True)
results = model.fit_transform(X)
print(model.results['explained_var'])
fig, ax = model.plot()
fig.savefig('personality_performance.png')

personality_performance.png

BradKML commented 1 year ago

Current find: Six Components are enough to describe all data with Eigenvalue > 1 when used with RobustScaler. [9.136, 2.683, 2.078, 1.420, 1.328, 1.090] similar to the amount without scaling MaxAbsScaler yield only one weak element whilst StandardScaler yielded 159 components with first six being [52.051, 12.815, 9.561, 8.205, 6.741, 5.902]. It seems that normalization does not help with clearing noise in some cases.

Question 4: How can one check the significance of an ICA component Question 5: If one were to use the 159 components, what are the strategy of determining the designation of the most useful columns in each column?

from pandas import read_csv
from sklearn.preprocessing import MaxAbsScaler, RobustScaler, StandardScaler

df = read_csv('https://files.catbox.moe/4nztka.csv')
df = df.drop(columns=df.columns[0], axis=1)
X, y = df.drop(columns=['AFQT']), df[['AFQT']]
# X_transformed = MaxAbsScaler().fit_transform(X) # in case of Yes/No Question
X_transformed = RobustScaler().fit_transform(X) # in case of Likert Scale
# X_transformed = StandardScaler().fit_transform(X) # in case of aggregates

from numpy import mean, dot
from sklearn.decomposition import PCA, FastICA
from pandas import DataFrame

pca = PCA()
X_transformed_pca = pca.fit_transform(X_transformed)
suff_len = len([i for i in pca.explained_variance_ if i > 1])
print(pca.explained_variance_[:suff_len])

ica = FastICA(n_components=suff_len)
X_transformed_ica = ica.fit_transform(X_transformed)
df_comp = DataFrame(ica.components_, columns = X.columns)