102207429 / Expedia

數據科學與大數據分析
0 stars 0 forks source link

Highly Correlated Attributes #11

Open 102207429 opened 7 years ago

102207429 commented 7 years ago

I use R and find out the attributes which are highly correlated to price(our target). I think we should get rid of them, because they have no contribution to our prediction. (p.s. I've taken out the 17 attributes that are not in type numeric, and defined highly correlated as correlation coefficient > 0.8.) The result was 97 attributes left. I'll just use these 97 attributes to analysis if you think it's alright. Or maybe I can adjust the definition of "highly correlated"?

GregoryWu commented 7 years ago

In python, to find out the correlation between Y and each x can be done with: FeatureCorrelate = train.corr()["price_doc"]

May I know the 17 attributes you've taken out are all lowly correlated to Y? Did you combine features from training data set and macro data set together? When combine, the total number of features will be more than 300.

in terms of highly correlated features, I think we can group them to several groups, then use PCA to compress them. For instance, x1,x2,x3 are highly correlated, x4,x5,x6 are highly correlated. After applying PCA, x1,x2,x3 will become one feature and the other three become another feature.

May be you can send your R code, then I will try to write them into python:)

102207429 commented 7 years ago

oh, I've just used training data set. Should I use both of them combined?

Group them is a good idea. I've put my R code in the Code area in Github, maybe you can take a look at it. I filled the missing values first, then remove the redundant features. (here we can just group them instead of remove them) Finally I rank features by importance. (R runs really slowly here, maybe python can get more quickly)

GregoryWu commented 7 years ago

Yes! it's better to combine both of the files together, because economically, the real estate price is highly correlated with "macro data set".

How do you combine the redundant features with others?

I will check your code on Tuesday night. For more details, we will discuss on Wednesday.