EHWUSF / HS68_2018_Project_1

0 stars 9 forks source link

PCA (Feature extraction/engineering) #3

Open choikwun opened 6 years ago

choikwun commented 6 years ago

High dimensional data makes model training slow and it is hard to visualize and conceptualize. Using PCA, we can reduce the number of dimensions and make the data more manageable. I propose a module which takes in a cleaned np array, determines which variables are correlated to each other, performs scaling on the correlated variables and performs PCA on the correlated variables. The output would an np.array.

omidkj commented 6 years ago

What is the propose of this : "determines which variables are correlated to each other" before performing PCA?

rohitchadaram commented 6 years ago

Could you help me understand better, how doing the PCA would help/contribute towards doing linear regression from the PCA output.

nitieaj commented 6 years ago

While PCA reduces high dimensional data ,it provides linear related output variables that can be used in regression .Modified PCA predictors would have better model performance metrics compared to the highly dimensional raw predictors.

choikwun commented 6 years ago

@omidkj specifically picking only the variables which are correlated would make it easier to explain PCA.

choikwun commented 5 years ago

As @EHWUSF said today, I have still to decide on whether to scale first and then assess correlation or scale after assessing correlation. I believe that I should scale first and then assess correlation.