Closed Biostat44 closed 3 years ago
Hi, good question *for inputs variables 3 solutions : -the simpliest way is to keep only the numerical IDs of factors (it's called dummy encoding) -or you can binarize by creating n-1 columns, one for each class with 0 or 1 -I often test the hashing trick if I had only 2 packages to work with R, I would always use automl & FeatureHashing
Hope I help
Hello, First of all, thank you very much for your help. I examined the example in the "automl_predict" function in the package. Made in the example "res <- cbind (ymat, round (automl_predict (model = amlmodel, X = xmat)))" When I ran the code, I saw that the predictions made could belong to two different classes at the same time. In the picture I sent as an example, he estimated observation number 119 as 1 in both the 2nd grade and the 3rd grade. Is this normally possible? that is, can the result of an observation belong to two different classes at the same time? For purposes of example, he estimated observation number 119 as 1 in both the 2nd and 3rd year. Or is there an alternative method that can be used instead of the "round" function here? Thanks again
Pleasure to help ! You should pass the output through a softmax function before round, it would just be perfect ;-)
Actually I could not understand exactly what you mean :(. The codes are as given in the appendix. What can I do in which step can you help? data(iris) xmat = iris[,1:4] lab2pred <- levels(iris$Species) lghlab <- length(lab2pred) iris$Species <- as.numeric(iris$Species) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(iris$Species)) + 0 amlmodel <- automl_train_manual(Xref = xmat, Yref = ymat, hpar = list(modexec = 'trainwpso', verbose = FALSE)) res <- cbind(ymat, round(automl_predict(model = amlmodel, X = xmat)))
In fact softmax would only amplify the bigger scores and makes them sum up to 1 like probabilities, but maybe you only want the max, so replacing the last line by following code would do your job : yhat <- automl_predict(model = amlmodel, X = xmat); #to get the predictions yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)})); #to get the max for each row res <- cbind(ymat, yhat); #to bind actual with prediction
Thank you very much for your reply. Another question is this. Is it appropriate to use this package in the code if at least one of the estimators is qualitative in the data set? Thank you so much.
Of course you can use ! All Neural Nets (from classic perceptrons to CNNs, or RNNs) have to transform text data into digital before learning and scoring
Thank you so much for your interest.
Hello again, As far as I understand from your previous answer, when I operate with a data set with qualitative variables, I get the error in the picture I sent in the attachment. I do not encounter this error in a dataset that does not contain qualitative data. So how can I include datasets containing qualitative / categorical / factor data into the analysis? Is digitalization meant to transform qualitative variables into quantitative variables?
...Is digitalization meant to transform qualitative variables into quantitative variables?...
: exactly !
Hello again, I could not fully understand. Let my output variable be a categorical / qualitative variable and input variables be age (quantitative/numerical) and gender (qualitative/categorical). How can I model the data including categorical and numerical input variables? Does any transformation need that categorical input (i.e., gender)? Thanks so much again, good luck.
By reading my first answer this way ! ;-)
for input variables be age (quantitative/numerical)
-> nothing special to do
for gender (qualitative/categorical)
->
*for inputs variables 3 solutions : -the simpliest way is to keep only the numerical IDs of factors (it's called dummy encoding) -or you can binarize by creating n-1 columns, one for each class with 0 or 1 -I often test the hashing trick if I had only 2 packages to work with R, I would always use automl & FeatureHashing
for Let my output variable be a categorical / qualitative variable
->
concerning outputs variables (target) Personally I use binarization (with softmax or not) but dummy encoding can do the trick
Good evening again Considering your suggestions, I created the following code directories. After the predictions, I tried to create the "Confussion matrix" part. However, it was not what I hoped. Can you help me with this? Thank you very much in advance for your interest. I am also adding the sample dataset that I use. automl_featurehashing.xlsx
dataset$Age=as.numeric(dataset$Age) str(dataset)
outcomeName <- 'Groups' Group=dataset[,outcomeName] lab2pred <- levels(Group) dataset2_hash=dataset[,-1]
predictorNames <- setdiff(names(dataset2_hash),'Groups')
set.seed(123) split <- sample(nrow(dataset2_hash), floor(0.8*nrow(dataset2_hash))) objTrain <-dataset2_hash[split,] gruptrain=Group[split] objTest <- dataset2_hash[-split,] gruptest=Group[-split]
library(FeatureHashing) objTrain_hashed = hashed.model.matrix(~., data=objTrain[,predictorNames], hash.size=8, transpose=FALSE) objTrain_hashed = as(objTrain_hashed, "dgCMatrix") objTraingrup_hashed = hashed.model.matrix(~., data=as.data.frame(gruptrain), hash.size=8, transpose=FALSE) objTraingrup_hashed = as(objTraingrup_hashed, "dgCMatrix") objTest_hashed = hashed.model.matrix(~., data=objTest[,predictorNames], hash.size=8, transpose=FALSE) objTest_hashed = as(objTest_hashed, "dgCMatrix") objTestgrup_hashed = hashed.model.matrix(~., data=as.data.frame(gruptest), hash.size=8, transpose=FALSE) objTestgrup_hashed = as(objTestgrup_hashed, "dgCMatrix")
amlmodel <- automl_train(Xref = objTrain_hashed, Yref = objTraingrup_hashed, autopar = list(numiterations = 1, psopartpopsize = 1, seed = 11), hpar = list(modexec = 'trainwpso', verbose = FALSE))
yhat <- automl_predict(model = amlmodel, X = as.matrix(objTest_hashed)) yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)}))
res <- cbind(objTestgruphashed, yhat); #to bind actual with prediction colnames(res) <- c(paste('act',lab2pred, sep = ''), paste('pred',lab2pred, sep = '_'))
real=as.factor(sapply(1:6, function(i) colnames(res[,1:3])[which(res[,1:3][i,]==1)])) real=as.factor(sapply(1:6, function(i) substring(real[i], 5))) pred=as.factor(sapply(1:6, function(i) colnames(res[,4:6])[which(res[,4:6][i,]==1)])) pred=as.factor(sapply(1:6, function(i) substring(pred[i], 6))) confusionMatrix(pred,real)
I really need your help and response from you. Thanks again
BAK ! sorry I've a job beside ;-) I'll have a look today
I only see 30 observations !!!! ... You won't do much with so few data (even considering data augmentation) I don't really understand what you want to do, but supposing that you want to modelize/predict the column Groups from the others (not to mentioned: with more observations!), I would simply do it this way: dataset$Gender <- as.numeric(dataset$Gender) dataset$Severity <- as.numeric(dataset$Severity) xmat = dataset[,2:4] lghlab <- max(dataset$Groups) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(dataset$Groups)) + 0 amlmodel <- automl_train(Xref = xmat, Yref = ymat);#to customize according to documentation yhat <- automl_predict(model = amlmodel, X = xmat); #to get the predictions yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)})); #to get the max for each row res <- cbind(ymat, yhat); #to bind actual with prediction
I'm really sorry for pissing you off and providing incomplete information :(. These data are not what I really want to do. With this data, my goal was to more easily construct and understand the concept of "feature hashing" (hence less observation). I am trying to understand how to combine automl and feature hashing by using the code directory I sent to ask. My goal is to guess the "Group" variable in the dataset. I am very, very grateful for your attention.
OK, so just replace those lines of code dataset$Gender <- as.numeric(dataset$Gender) dataset$Severity <- as.numeric(dataset$Severity) xmat = dataset[,2:4]
by
MyHashSize <- 2; # make your choice xmat <- FeatureHashing::hashed.model.matrix(~., data = dataset[,2:4], hash.size = MyHashSize, transpose = FALSE, create.mapping = FALSE, is.dgCMatrix = TRUE, signed.hash = FALSE, progress = FALSE)
Thank you, I want to ask the last two questions, with your permission . First, here should the minimum value of "hash.size" be the sum of the total categories of qualitative variables? Second, since the output variable is categorical, is it necessary to apply "FeatureHashing" to it as well? If so, would it be correct to edit the "Group" variable as follows? MyHashSize = 3 ymat <- FeatureHashing :: hashed.model.matrix (~., data = dataset [1], hash.size = MyHashSize, transpose = FALSE, create.mapping = FALSE, is.dgCMatrix = TRUE, signed.hash = FALSE, progress = FALSE)
You should consider FeatureHashing to reduce dimensionality and quick data preparation (in one go, ideal for live scoring), but only for inputs, maybe I'm missing something huge, but for output ...
Hello; When I use the automl package on the data set with qualitative / categorical variables, the reference variable is numeric. Later, when I encode the reference variable numerically, the codes work, but the results are obtained numerically. Therefore, it is necessary to encode the results again. Does the automl package support variables for which the output variable is a factor? If the automl package does not work for qualitative variables, how can we model data sets containing qualitative variables using this package? If you could help with this situation, I would greatly appreciate it. Thank you good work.