bhklab / mRMRe

mRMRe is a package for Parallelized Minimum Redundancy, Maximum Relevance (mRMR) Ensemble Feature Selection
19 stars 6 forks source link

Trouble with my categorical target variable #27

Closed SvdvBui closed 4 years ago

SvdvBui commented 4 years ago

Hello,

I have a data frame "data" with 60 rows (=samples) and 20228 columns where the first column is my target variable (an ordered factor : 0 or 1) and the other columns are my features (=numeric). I want to do a feature selection with mRMRe in a loop corresponding to a 5-cross-validation that I do 3 times. I select every time 25 features. Here is the problematic part of my code :

library(caret)
library(mRMRe)

data <- read.csv("home/RNA_seq.csv", row.names=1, sep=";", stringsAsFactors=FALSE)
data <- data.frame(t(data))
data[,1] <- factor(data[,1])
data[,1] <- ordered(data[,1], levels = c("0", "1"))

features_select <- list()

r <- 5 # 5-cross-validation
t <- 3 # 5-cross-validation done 3 times
  for (j in 1:t){
    for (i in 1:r){
      #5-cross-validation
      train.index <- createFolds(factor(data$Response), k = 5, list = TRUE, returnTrain = TRUE) 
      datatrain <- data[train.index[[i]],]
      datatest  <- data[-train.index[[i]],]

      #Feature selection
      data.mrmre.train <- mRMR.data(data=datatrain)
      res.fs.mrmr <- mRMR.classic(data=data.mrmre.train, target_indices=1, feature_count=25)
      selected.features.mrmre <- mRMRe::solutions(res.fs.mrmr)
      features_select[[((j-1)*r+i)]] <- res.fs.mrmr@feature_names[unlist(selected.features.mrmre)]
      print(features_select[[((j-1)*r+i)]])
      print(res.fs.mrmr)
    }
  }

My problem is that sometimes my target variable called "Response"(=column 1 of "data") is selected by mRMRe as a feature. When this is the case, my target variable "Response" is always selected from feature 2 up to the number requested (here 25). For example :

features_select :

[[1]]
[1] "AC137800.2" "AC007387.1" "AC079354.1" "AC145138.1" "RNA5SP370" 
[6] "RNA5SP219"  "AL022324.1" "AC023449.1" "AP000873.1" "AC020612.2"
[11] "RNA5SP473"  "AC092810.1" "IGKV1D.37"  "SST"        "AC093331.1"
[16] "TRAJ34"     "AC107983.1" "RPL39P"     "HSBP1P1"    "TRBJ1.6"   
[21] "PHGR1"      "RNA5SP435"  "RNA5SP301"  "AC005255.1" "KRT127P"

[[2]]
 [1] "AC073869.8"   "Response" "Response" "Response" "Response" "Response"
 [7] "Response" "Response" "Response" "Response" "Response" "Response"
[13] "Response" "Response" "Response" "Response" "Response" "Response"
[19] "Response" "Response" "Response" "Response" "Response" "Response"
[25] "Response"

Here is the output of the function mRMR.classic() corresponding to the 2 features sets above.

[[1]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 18837 18781 15503 15526 17437 20028 18924 17133 17024 16104 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0.817 0.819 0.817 0.817 0.817 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.3786 -0.1536 -0.0929 -0.0964 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

[[2]]
Formal class 'mRMRe.Filter' [package "mRMRe"] with 8 slots
  ..@ filters       :List of 1
  .. ..$ 1: int [1:25, 1] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ scores        :List of 1
  .. ..$ 1: num [1:25, 1] 0 0 0 0 0 0 0 0 0 0 ...
  ..@ mi_matrix     : num [1:20228, 1:20228] NA -0.518 -0.246 -0.211 -0.204 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  .. .. ..$ : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ causality_list:List of 1
  .. ..$ 1: num [1:20228] NA NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
  ..@ sample_names  : chr [1:48] "Pt1_28" "Pt2_28" "Pt4_28" "Pt5_28" ...
  ..@ feature_names : chr [1:20228] "Response" "TMSB15B" "MATR3" "HSPA14" ...
  ..@ target_indices: int 1
  ..@ levels        : int [1:25] 1 1 1 1 1 1 1 1 1 1 ...

This doesn't appear every time for the same value of i and j into the loop. Do you have an idea where is the problem ?

Thank you in advance !

ChristopherEeles commented 4 years ago

I reviewed your code and believe I have found the issue. If you consult the constructor documentation with ?mRMR.data you will see that:

data: is expected to be a data frame with samples and features respectively organized as rows and columns. The columns have to be of type :numeric, ordered factor, Surv and respectively interpreted as :continuous, discrete and survival variables.

Given that Response is being passed as a column to the constructor, it interprets this to mean the response is a categorical predictor. Instead this column should be passed to the strata parameter, which:

is expected to be a vector of type :ordered factor with the strata associated to the samples provided in data.

Please try this updated code and get back to me if it here if it has resolved the issue:

library(caret)
library(mRMRe)

data <- read.csv("home/RNA_seq.csv", row.names=1, sep=";", stringsAsFactors=FALSE)
data <- data.frame(t(data))
data[,1] <- factor(data[,1])
data[,1] <- ordered(data[,1], levels = c("0", "1"))

features_select <- list()

r <- 5 # 5-cross-validation
t <- 3 # 5-cross-validation done 3 times
for (j in 1:t){
  for (i in 1:r){
    #5-cross-validation
    train.index <- createFolds(factor(data$Response), k = 5, list = TRUE, returnTrain = TRUE) 
    datatrain <- data[train.index[[i]],]
    datatest  <- data[-train.index[[i]],]

    #Feature selection
    data.mrmre.train <- mRMR.data(data=datatrain[,-1], strata = datatrain[,1])
    res.fs.mrmr <- mRMR.classic(data=data.mrmre.train, target_indices=1, feature_count=25)
    selected.features.mrmre <- mRMRe::solutions(res.fs.mrmr)
    features_select[[((j-1)*r+i)]] <- res.fs.mrmr@feature_names[unlist(selected.features.mrmre)]
    print(features_select[[((j-1)*r+i)]])
    print(res.fs.mrmr)
  }
}
SvdvBui commented 4 years ago

Thank you so much! That works now! Sorry for bothering you with my stupid question but I didn't find by myself the solution. Thanks again!

kepaabdel commented 3 years ago

Hello, I have a question about your code, I am making a selection of features, and I really like your idea with the cross validation, but I have a question about the test database, I do not see it in the code part, I want to know please the utility of repartition of the database in test and train, can you explain me more please. Thank you very much.

GliozzoJ commented 2 years ago

I reviewed your code and believe I have found the issue. If you consult the constructor documentation with ?mRMR.data you will see that:

data: is expected to be a data frame with samples and features respectively organized as rows and columns. The columns have to be of type :numeric, ordered factor, Surv and respectively interpreted as :continuous, discrete and survival variables.

Given that Response is being passed as a column to the constructor, it interprets this to mean the response is a categorical predictor. Instead this column should be passed to the strata parameter, which:

is expected to be a vector of type :ordered factor with the strata associated to the samples provided in data.

Please try this updated code and get back to me if it here if it has resolved the issue:

library(caret)
library(mRMRe)

data <- read.csv("home/RNA_seq.csv", row.names=1, sep=";", stringsAsFactors=FALSE)
data <- data.frame(t(data))
data[,1] <- factor(data[,1])
data[,1] <- ordered(data[,1], levels = c("0", "1"))

features_select <- list()

r <- 5 # 5-cross-validation
t <- 3 # 5-cross-validation done 3 times
for (j in 1:t){
  for (i in 1:r){
    #5-cross-validation
    train.index <- createFolds(factor(data$Response), k = 5, list = TRUE, returnTrain = TRUE) 
    datatrain <- data[train.index[[i]],]
    datatest  <- data[-train.index[[i]],]

    #Feature selection
    data.mrmre.train <- mRMR.data(data=datatrain[,-1], strata = datatrain[,1])
    res.fs.mrmr <- mRMR.classic(data=data.mrmre.train, target_indices=1, feature_count=25)
    selected.features.mrmre <- mRMRe::solutions(res.fs.mrmr)
    features_select[[((j-1)*r+i)]] <- res.fs.mrmr@feature_names[unlist(selected.features.mrmre)]
    print(features_select[[((j-1)*r+i)]])
    print(res.fs.mrmr)
  }
}

Hi there! I am trying to use the mRMR.ensemble function with bootstrap method for feature selection. I stumbled on this problem, where my dependent variable (set using the argument target_indices) is selected as solution at every bootstrap. Strangely, it is always selected as last one and repeated until the fixed number of features to select is reached feature_count. I thought that I was creating the data object in a wrong way so I tried the above solution, which consists in removing the dependent variable from argument "data" and giving it in input as "strata" in the function mRmR.data(). However, I do not get the sense of calling the function mRMR.esemble (or mRMR.classic as above) using the option target_indices=1, since in this was the target is not my dependent variable anymore but the first dependent variable in the dataset. This is because in the above mentioned solution:

data.mrmre.train <- mRMR.data(data=datatrain[,-1], strata = datatrain[,1])

the argument data do not actually contains the dependent variable of interest anymore. In this sense, the mRMR is selecting the set of variables that has maximum relevance and minimum redundancy with respect to an independent variable.

Am I missing something?

Thank you in advance :)

DaveEvenden commented 3 months ago

I am a bit puzzled by this solution. I have two queries, although the second may be a style thingy.

1) If I understand correctly, in mRMR.data(data=datatrain[,-1], strata = datatrain[,1]), the response variable is allocated the strata slot, rather than part of the _featurenames slot as in the OP's code. However in _mRMR.classic(data=data.mrmre.train, target_indices=1, featurecount=25), the variable _targetindices=1 would surely not point to the required response, but to one of the other features. Is that right? How should _targetindices be used when the response is in the strata slot and no longer in the _featurenames slot?

2) Both the OP's code and the solution code have createFolds in the inner loop. Doesn't that mean the folds are created anew within each inner loop iteration with issues arising about test fold uniqueness/orthogonality. (This might be what's intended....). It seems to me having createFolds in the outer loop overcomes this. Even better(?) define createMultiFolds outside the loops, and access within. Just a thought.

Thanks, Dave