InseadDataAnalytics / INSEADAnalytics

Other
122 stars 1.31k forks source link

Problem with cor() creating NA's in #149

Open GuillaumeBot opened 6 years ago

GuillaumeBot commented 6 years ago

Hi all, Hi @Anton262 & @VarunKShetty & @tevgeniou,

I wanted to applied MarketSegmentationProcessInClassParts1and2.Rmd to run an unsupervised learning on our data set for final project. However, I have difficulties to run the code, since my data are of different kinds (integer, factors, mainly) as opposed to assignment 3 boat data. Error log is the following (line 224):

3. stop("supply both 'x' and 'y' or a matrix-like 'x'")
2. cor(r, use = "pairwise")
1. principal(ProjectDataFactor, nfactors = max(factors_selected), rotate = rotation_used, score = TRUE)

The error occurs here:

Rotated_Factors<-round(Rotated_Results$loadings,2)
Rotated_Factors<-as.data.frame(unclass(Rotated_Factors))
colnames(Rotated_Factors)<-paste("Comp.",1:ncol(Rotated_Factors),sep="")

sorted_rows <- sort(Rotated_Factors[,1], decreasing = TRUE, index.return = TRUE)$ix
Rotated_Factors <- Rotated_Factors[sorted_rows,]

iprint.df(Rotated_Factors, scale=TRUE)
write.csv(Rotated_Factors, file = "Rotated_Factors.csv")

but I believe this is the root cause. So I tried to change cor() to cor2(), which should handle the different types... https://www.rdocumentation.org/packages/ParallelPC/versions/1.2/topics/cor2

thecor = round(cor2(ProjectDataFactor),2) #Cor2 is supposed to handle the different type of variable
iprint.df(round(thecor,2), scale=TRUE)
write.csv(round(thecor,2), file = "thecor.csv")

Any idea?

tevgeniou commented 6 years ago

the problem is not with cor, but with "principal", which works only for numeric data. Choices are to either generate new numeric (meaningful) features and use those, or (less preferred) to try what is called "correspondence analysis" (https://en.wikipedia.org/wiki/Correspondence_analysis)