MaxHalford / prince

:crown: Multivariate exploratory data analysis in Python — PCA, CA, MCA, MFA, FAMD, GPA
https://maxhalford.github.io/prince
MIT License
1.25k stars 182 forks source link

Q: what is a good option for FA with Boolean Data? #133

Closed BradKML closed 1 year ago

BradKML commented 1 year ago

Problem: Currently there is a binary data (Yes/No question dataset) that could benefit from dimensionality reduction, and be applied to feature selection and regression. The data is ves_data.csv.zip

Currently there are options for doing this:

Some information to get the data started

```python from pandas import read_csv table = read_csv('ves_data.csv') total = table[[i for i in table if ('MM01' in i and i not in [ 'MM01001','MM01BR','MM01003A','MM01003B', 'MM01003C','MM010567','MM010568','MM010569','MM010570','MM010571', 'MM010572','MM010573','MM010574','MM010575','MM010576','MM010577', 'MM010578','MM010579','MM010580','MM010581']) or i in ["PA", "GIT", "AFQT", "WAIS_BD", "WAIS_GI", 'VERAW','ARRAW','VESS','ARSS']]] total = total[[i for i in table if ('MM01' in i and i not in [ 'MM01001','MM01BR','MM01003A','MM01003B', 'MM01003C','MM010567','MM010568','MM010569','MM010570','MM010571', 'MM010572','MM010573','MM010574','MM010575','MM010576','MM010577', 'MM010578','MM010579','MM010580','MM010581']) or i in ['AFQT']]].dropna() # 'GIT' is good too from sklearn.utils import shuffle X, y = shuffle(total.drop(['AFQT'], axis=1), total['AFQT'], random_state=13) X = X - 1 # calibrating the range from 1~2 to 0~1 X = X.to_numpy() # needed for some code to function ```
MaxHalford commented 1 year ago

I'm closing this because I'm not sure what there is more to say. There might indeed be better methods for handling boolean data. Feel free to contribute one if you can show it's relevant.