automl Problem with qualitative data?

Biostat44 commented 3 years ago

Hello; When I use the automl package on the data set with qualitative / categorical variables, the reference variable is numeric. Later, when I encode the reference variable numerically, the codes work, but the results are obtained numerically. Therefore, it is necessary to encode the results again. Does the automl package support variables for which the output variable is a factor? If the automl package does not work for qualitative variables, how can we model data sets containing qualitative variables using this package? If you could help with this situation, I would greatly appreciate it. Thank you good work.

aboulaboul commented 3 years ago

Hi, good question *for inputs variables 3 solutions : -the simpliest way is to keep only the numerical IDs of factors (it's called dummy encoding) -or you can binarize by creating n-1 columns, one for each class with 0 or 1 -I often test the hashing trick if I had only 2 packages to work with R, I would always use automl & FeatureHashing

concerning outputs variables (target) Personally I use binarization (with softmax or not) but dummy encoding can do the trick

Hope I help

Biostat44 commented 3 years ago

Hello, First of all, thank you very much for your help. I examined the example in the "automl_predict" function in the package. Made in the example "res <- cbind (ymat, round (automl_predict (model = amlmodel, X = xmat)))" When I ran the code, I saw that the predictions made could belong to two different classes at the same time. In the picture I sent as an example, he estimated observation number 119 as 1 in both the 2nd grade and the 3rd grade. Is this normally possible? that is, can the result of an observation belong to two different classes at the same time? For purposes of example, he estimated observation number 119 as 1 in both the 2nd and 3rd year. Or is there an alternative method that can be used instead of the "round" function here? Thanks again image_automl

aboulaboul commented 3 years ago

Pleasure to help ! You should pass the output through a softmax function before round, it would just be perfect ;-)

Biostat44 commented 3 years ago

Actually I could not understand exactly what you mean :(. The codes are as given in the appendix. What can I do in which step can you help? data(iris) xmat = iris[,1:4] lab2pred <- levels(iris$Species) lghlab <- length(lab2pred) iris$Species <- as.numeric(iris$Species) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(iris$Species)) + 0 amlmodel <- automl_train_manual(Xref = xmat, Yref = ymat, hpar = list(modexec = 'trainwpso', verbose = FALSE)) res <- cbind(ymat, round(automl_predict(model = amlmodel, X = xmat)))

aboulaboul commented 3 years ago

In fact softmax would only amplify the bigger scores and makes them sum up to 1 like probabilities, but maybe you only want the max, so replacing the last line by following code would do your job : yhat <- automl_predict(model = amlmodel, X = xmat); #to get the predictions yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)})); #to get the max for each row res <- cbind(ymat, yhat); #to bind actual with prediction

Biostat44 commented 3 years ago

Thank you very much for your reply. Another question is this. Is it appropriate to use this package in the code if at least one of the estimators is qualitative in the data set? Thank you so much.

aboulaboul commented 3 years ago

Of course you can use ! All Neural Nets (from classic perceptrons to CNNs, or RNNs) have to transform text data into digital before learning and scoring

Biostat44 commented 3 years ago

Thank you so much for your interest.

Biostat44 commented 3 years ago

Hello again, As far as I understand from your previous answer, when I operate with a data set with qualitative variables, I get the error in the picture I sent in the attachment. I do not encounter this error in a dataset that does not contain qualitative data. So how can I include datasets containing qualitative / categorical / factor data into the analysis? Is digitalization meant to transform qualitative variables into quantitative variables? auto

aboulaboul commented 3 years ago

...Is digitalization meant to transform qualitative variables into quantitative variables?... : exactly !

Biostat44 commented 3 years ago

Hello again, I could not fully understand. Let my output variable be a categorical / qualitative variable and input variables be age (quantitative/numerical) and gender (qualitative/categorical). How can I model the data including categorical and numerical input variables? Does any transformation need that categorical input (i.e., gender)? Thanks so much again, good luck.

aboulaboul commented 3 years ago

By reading my first answer this way ! ;-)

for input variables be age (quantitative/numerical) -> nothing special to do

for gender (qualitative/categorical) -> *for inputs variables 3 solutions : -the simpliest way is to keep only the numerical IDs of factors (it's called dummy encoding) -or you can binarize by creating n-1 columns, one for each class with 0 or 1 -I often test the hashing trick if I had only 2 packages to work with R, I would always use automl & FeatureHashing

for Let my output variable be a categorical / qualitative variable -> concerning outputs variables (target) Personally I use binarization (with softmax or not) but dummy encoding can do the trick

Biostat44 commented 3 years ago

Good evening again Considering your suggestions, I created the following code directories. After the predictions, I tried to create the "Confussion matrix" part. However, it was not what I hoped. Can you help me with this? Thank you very much in advance for your interest. I am also adding the sample dataset that I use. automl_featurehashing.xlsx

dataset$Age=as.numeric(dataset$Age) str(dataset)

outcomeName <- 'Groups' Group=dataset[,outcomeName] lab2pred <- levels(Group) dataset2_hash=dataset[,-1]

predictorNames <- setdiff(names(dataset2_hash),'Groups')

set.seed(123) split <- sample(nrow(dataset2_hash), floor(0.8*nrow(dataset2_hash))) objTrain <-dataset2_hash[split,] gruptrain=Group[split] objTest <- dataset2_hash[-split,] gruptest=Group[-split]

library(FeatureHashing) objTrain_hashed = hashed.model.matrix(~., data=objTrain[,predictorNames], hash.size=8, transpose=FALSE) objTrain_hashed = as(objTrain_hashed, "dgCMatrix") objTraingrup_hashed = hashed.model.matrix(~., data=as.data.frame(gruptrain), hash.size=8, transpose=FALSE) objTraingrup_hashed = as(objTraingrup_hashed, "dgCMatrix") objTest_hashed = hashed.model.matrix(~., data=objTest[,predictorNames], hash.size=8, transpose=FALSE) objTest_hashed = as(objTest_hashed, "dgCMatrix") objTestgrup_hashed = hashed.model.matrix(~., data=as.data.frame(gruptest), hash.size=8, transpose=FALSE) objTestgrup_hashed = as(objTestgrup_hashed, "dgCMatrix")

amlmodel <- automl_train(Xref = objTrain_hashed, Yref = objTraingrup_hashed, autopar = list(numiterations = 1, psopartpopsize = 1, seed = 11), hpar = list(modexec = 'trainwpso', verbose = FALSE))

yhat <- automl_predict(model = amlmodel, X = as.matrix(objTest_hashed)) yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)}))

res <- cbind(objTestgruphashed, yhat); #to bind actual with prediction colnames(res) <- c(paste('act',lab2pred, sep = ''), paste('pred',lab2pred, sep = '_'))

real=as.factor(sapply(1:6, function(i) colnames(res[,1:3])[which(res[,1:3][i,]==1)])) real=as.factor(sapply(1:6, function(i) substring(real[i], 5))) pred=as.factor(sapply(1:6, function(i) colnames(res[,4:6])[which(res[,4:6][i,]==1)])) pred=as.factor(sapply(1:6, function(i) substring(pred[i], 6))) confusionMatrix(pred,real)

Biostat44 commented 3 years ago

I really need your help and response from you. Thanks again

aboulaboul commented 3 years ago

BAK ! sorry I've a job beside ;-) I'll have a look today

aboulaboul commented 3 years ago

I only see 30 observations !!!! ... You won't do much with so few data (even considering data augmentation) I don't really understand what you want to do, but supposing that you want to modelize/predict the column Groups from the others (not to mentioned: with more observations!), I would simply do it this way: dataset$Gender <- as.numeric(dataset$Gender) dataset$Severity <- as.numeric(dataset$Severity) xmat = dataset[,2:4] lghlab <- max(dataset$Groups) ymat <- matrix(seq(from = 1, to = lghlab, by = 1), nrow(xmat), lghlab, byrow = TRUE) ymat <- (ymat == as.numeric(dataset$Groups)) + 0 amlmodel <- automl_train(Xref = xmat, Yref = ymat);#to customize according to documentation yhat <- automl_predict(model = amlmodel, X = xmat); #to get the predictions yhat <- t(apply(yhat, 1, function(x){i=which.max(x);x[1:length(x)]=0;x[i]=1;return(x)})); #to get the max for each row res <- cbind(ymat, yhat); #to bind actual with prediction

Biostat44 commented 3 years ago

I'm really sorry for pissing you off and providing incomplete information :(. These data are not what I really want to do. With this data, my goal was to more easily construct and understand the concept of "feature hashing" (hence less observation). I am trying to understand how to combine automl and feature hashing by using the code directory I sent to ask. My goal is to guess the "Group" variable in the dataset. I am very, very grateful for your attention.

aboulaboul commented 3 years ago

OK, so just replace those lines of code dataset$Gender <- as.numeric(dataset$Gender) dataset$Severity <- as.numeric(dataset$Severity) xmat = dataset[,2:4]

by

MyHashSize <- 2; # make your choice xmat <- FeatureHashing::hashed.model.matrix(~., data = dataset[,2:4], hash.size = MyHashSize, transpose = FALSE, create.mapping = FALSE, is.dgCMatrix = TRUE, signed.hash = FALSE, progress = FALSE)

Biostat44 commented 3 years ago

Thank you, I want to ask the last two questions, with your permission . First, here should the minimum value of "hash.size" be the sum of the total categories of qualitative variables? Second, since the output variable is categorical, is it necessary to apply "FeatureHashing" to it as well? If so, would it be correct to edit the "Group" variable as follows? MyHashSize = 3 ymat <- FeatureHashing :: hashed.model.matrix (~., data = dataset [1], hash.size = MyHashSize, transpose = FALSE, create.mapping = FALSE, is.dgCMatrix = TRUE, signed.hash = FALSE, progress = FALSE)

aboulaboul commented 3 years ago

You should consider FeatureHashing to reduce dimensionality and quick data preparation (in one go, ideal for live scoring), but only for inputs, maybe I'm missing something huge, but for output ...

aboulaboul / automl

automl Problem with qualitative data? #3