agnesdeng / mixgb

mixgb: multiple imputation through XGBoost
https://agnesdeng.github.io/mixgb/
GNU General Public License v3.0
20 stars 6 forks source link

Thank you, thank you!!! #1

Open asheetal opened 3 years ago

asheetal commented 3 years ago

I do machine learning research using large datasets and missRanger was my goto package earlier, however even that took 1-2 months to impute. Really excited to see an alternative. A curiosity about xgboost. Does this use the xgboost.so file or does it recompile the xgboost binary? In other words, can it handle the cuda libraries? That would really be useful to me in terms of speedup. I have an xgboost.so precompiled that uses the mutli-GPU system setup.

specifically support for parameters, tree_method and gpu_id in your package

agnesdeng commented 3 years ago

Hi Asheetal,

Our package mixgb can be run with GPU support if your machine has NVIDIA GPUs. If you have already installed the R package xgboost with GPU support, you can set the parameter tree_method='gpu_hist' in the mixgb imputer to speed up the imputation process for quite a lot.

I have added a section in the readme.md file regarding GPU support. Hope that it helps.

agnesdeng commented 3 years ago

Also, we found that mixgb (even without using GPU support) tends to be evidently faster than missRanger especially for continuous data and when the number of observations is large. However, the advantage may be less obvious for high-dimensional categorical data with many classes. We haven't done any experiment on high-dimensional datasets yet and will be glad to see how mixgb works for your datasets.

asheetal commented 3 years ago

I used this command

MIXGB <- Mixgb.train$new(df.train, tree_method='gpu_hist', gpu_id = 3)
mixgb.obj <- MIXGB$impute(m=5)

I get this error

Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) : 
  Invalid Parameter format for gpu_id expect int but value='NA'

not adding gpu_id generates the same error

agnesdeng commented 3 years ago

Hi Asheetal,

Sorry that I hadn't updated the latest version on github but I just did. Please download the package again and try if it works for the following example.

devtools::install_github("agnesdeng/mixgb")

library(mixgb)

set.seed(2021)
n=nrow(iris)
idx=sample(1:n, size = round(0.7*n), replace=FALSE)

train.df=iris[idx,]
test.df=iris[-idx,]

trainNA.df=createNA(train.df,p=0.3)
testNA.df=createNA(test.df,p=0.3)

##
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist")
mixgb.obj=MIXGB$impute(m=5)

##
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist", gpu_id=3)
mixgb.obj=MIXGB$impute(m=5)

test.impute=impute.new(object = mixgb.obj, newdata = testNA.df)
test.impute

I have no problem running this. However, since my machine doesn't have multiple GPUs , when I set gpu_id=3, I would get a warning message "WARNING: ../src/learner.cc:231: Only 1 GPUs are visible, setting gpu_id to 0".

asheetal commented 3 years ago

Thanks. at least the first step runs now for me.

mixgb.obj <- MIXGB$impute(m=5)

Few observations

  1. I get error when there is a column with decimal numbers
    Error in xgb.DMatrix(newdata, missing = missing) : 
    xgb.DMatrix does not support construction from double.
  2. Getting an error when I try to impute the test
    > df.test.imputed <- impute.new(object = mixgb.obj, newdata = df.test)
    Error in validObject(.Object) : 
    invalid class “dgCMatrix” object: lengths of slots 'i' and 'x' must match
  3. Feels it took longer than missRanger. But I can be sure once I have tweaked all params and ran some benchmarks
agnesdeng commented 3 years ago

When using Mixgb$new(data=...) or Mixgb.train$new(data=...), data should be a dataframe, not a dgCMatrix. Did you feed in a xgb.DMatrix object instead?

Our package will automatically convert it for you.

library(mixgb)

set.seed(2021)

**#iris is a dataframe**
n=nrow(iris)
idx=sample(1:n, size = round(0.7*n), replace=FALSE)

**#train.df or test.df also are dataframes**
train.df=iris[idx,]
test.df=iris[-idx,]

**#we create some missing values**
trainNA.df=createNA(train.df,p=0.3)
testNA.df=createNA(test.df,p=0.3)

**##when we feed in the training data, it's still a dataframe trainNA.df, users do need to covert it to dgCMatrix by themselves**
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist")
agnesdeng commented 3 years ago

As for the speed, if your dataset has fewer observations but a large number of multiclass categorical variables (eg, 1 categorical variable with 100 classes), missRanger seems to run faster but the imputation quality hasn't been investigated yet.

asheetal commented 3 years ago

It is dataframe. I usually dummy code before imputation. Sorry I had spoken too soon earlier. Different servers have the same problem. Maybe I can share the rds file later if that helps

asheetal commented 3 years ago

Sharing a list object. Has both test/train dataframes https://www.dropbox.com/s/hl3y47wpnxsaghc/for_github.rds?dl=0

agnesdeng commented 3 years ago

Hi Asheetal,

Our package doesn't require dummy coding before imputation since we will automatically convert it. The variables in a dataframe should be numeric, binary, or multiclass-categorical.

asheetal commented 3 years ago

Thanks, let me know if the above dropbox link works. Curious if this data works at your end.

asheetal commented 3 years ago

Does this dataframe work at your end? Wondering if similar newdata feature is on roadmap for autoencoder?

agnesdeng commented 3 years ago

Yes. The dropbox link works. There are two data frames in the list. I tried one of them. It worked when I trained the whole dataset but it showed errors when I used the Mixgb.train for 70% training data. ​It turns out that some variables in the subsetted data only have one missing value and predicting a vector using a xgboost model is not allowed by XGBoost (It requires a matrix form).

I fix this and it should work under this kind of scenario. I tried one of your data frames and it worked too. Feel free to reinstall the package mixgb and have a try. Let me know if there are more errors. Your feedback is much appreciated.

I notice that it does take quite a long time to impute your dataset as one categorical variable has 40 levels. For this type of datasets, it would be much faster to use autoencoder to impute. However, we are still investigating the imputation performance of autoencoder.

agnesdeng commented 3 years ago

Does this dataframe work at your end? Wondering if similar newdata feature is on roadmap for autoencoder?

Yes, imputing new data will be added to autoencoder imputers soon. Currently, I am working on migrating my code from tensorflow1 to tensorflow2. After that, I should be available to add more features to autoencoder imputer in the package misle. I will let you know when it's done.

asheetal commented 3 years ago

Thanks......I get a warning that is odd. Firstly, there is no folder there, secondly there is no XGBoost 1.3 in the system. My system has XGBoost 1.5.dev. Is this some fixed message from mixgb? Will test newdata and report soon.

WARNING: /home/asheetal/Downloads/xgboost/src/learner.cc:1094: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
agnesdeng commented 3 years ago

This is a warning message from XGBoost, not from mixgb. Mainly it tells users that starting from XGBoost version 1.3.0, the default setting for the error metric has been changed. If users prefer to use the old default setting, they can set it manually.

I may try to suppress this warning for a later update of mixgb .