bhklab / mRMRe

mRMRe is a package for Parallelized Minimum Redundancy, Maximum Relevance (mRMR) Ensemble Feature Selection
19 stars 6 forks source link

Selecting feature subset from a feature set using mRMRe R package #16

Open abu034004 opened 6 years ago

abu034004 commented 6 years ago

Hi, As per suggestion in the email reply of Dr. Benjamin Haibe-Kains, I am creating an issue regarding my query. Please excuse if the question is simple as I am new in R. Below is the detail.

Suppose, I have a csv file gene.csv (CSV file is attached as zip file- gene.zip) having feature set of 6 attributes ([G1.1.1.1], [G1.1.1.2], [G1.1.1.3], [G1.1.1.4], [G1.1.1.5], [G1.1.1.6]) and a target class variable [Output] ('1' indicates positive class and '-1' stands for negative class). Here's the sample gene.csv file (see attached zip file):

[G1.1.1.1]  [G1.1.1.2]  [G1.1.1.3]     [G1.1.1.4]   [G1.1.1.5]  [G1.1.1.6]  [Output]
11.688312   0.974026    4.87013    7.142857 3.571429    10.064935    -1
12.538226   1.223242    3.669725       6.116208 3.363914    9.174312       1
10.791367   0.719424    6.115108       6.47482  3.597122    10.791367    -1
13.533835   0.37594 6.766917       7.142857 2.631579    10.902256     1
9.737828    2.247191    5.992509       5.992509 2.996255    8.614232      -1
11.864407   0.564972    7.344633       4.519774 3.389831    7.909605      -1
11.931818   0           7.386364       5.113636 3.409091    6.818182       1
16.666667   0.333333    7.333333       4.333333 2           8.333333      -1

I am trying to get best feature subset of 2 attributes (out of above 6 attributes) and wrote following R code.

library(mRMRe)
file_n<-paste0("E:\\gene", ".csv")
df <- read.csv(file_n, header = TRUE)
f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7, 
              feature_count = 2, solution_count = 1)

When I run this code, I am getting following error for the statement f_data <- mRMR.data(data = data.frame(df)):

Error in .local(.Object, ...) : 
  data columns must be either of numeric, ordered factor or Surv type

However, my data in each column of the csv file are real number. So, how can I change the R code to fix this problem? Also, I am not sure what should be the value of target_indices in the statement mRMR.ensemble(data = f_data, target_indices = 7, feature_count = 2, solution_count = 1) as my target class variable name is "[Output]" in the gene.csv file.

I will appreciate much if you kindly help me to obtain the best feature subset based on the gene.csv file using your mRMRe R package.

Thank you very much.

Sincerely, Abu

ba3lwi commented 6 years ago

Hi Abu,

Thank you for your interest in using our package and for reporting this issue.

The problem you are facing is an R characteristic, not specific to the package. Since you read the csv file using the function read.csv, it tries to infer the class types of each column and because your output column has no decimals [only 1,-1], it gave it a class of type "integer" and that caused the error. So, changing that column class type to numeric should do the trick. We will try to update the package also to check for that and maybe do that implicitly.

In short, just add the following line after the read.csv line:

library(mRMRe)
df <- read.csv("gene.csv", header = TRUE)

sapply(df, class) # this will show you the classes of all columns in df
df <- transform(df, X.Output. = as.numeric(X.Output.)) # this will change the output column class into "numeric"
sapply(df, class) # to check that the change is in effect

f_data <- mRMR.data(data = data.frame(df))
featureData(f_data)
mRMR.ensemble(data = f_data, target_indices = 7, 
              feature_count = 2, solution_count = 1)

As for what should be the value of target_indices, it can be value if you have one target representing the index of the target column in your df.

Please let us know if this solves your problem or if you face any other problem.

Best, Wail