Open asheetal opened 3 years ago
Hi Asheetal,
Our package mixgb can be run with GPU support if your machine has NVIDIA GPUs. If you have already installed the R package xgboost with GPU support, you can set the parameter tree_method='gpu_hist' in the mixgb imputer to speed up the imputation process for quite a lot.
I have added a section in the readme.md file regarding GPU support. Hope that it helps.
Also, we found that mixgb (even without using GPU support) tends to be evidently faster than missRanger especially for continuous data and when the number of observations is large. However, the advantage may be less obvious for high-dimensional categorical data with many classes. We haven't done any experiment on high-dimensional datasets yet and will be glad to see how mixgb works for your datasets.
I used this command
MIXGB <- Mixgb.train$new(df.train, tree_method='gpu_hist', gpu_id = 3)
mixgb.obj <- MIXGB$impute(m=5)
I get this error
Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Invalid Parameter format for gpu_id expect int but value='NA'
not adding gpu_id generates the same error
Hi Asheetal,
Sorry that I hadn't updated the latest version on github but I just did. Please download the package again and try if it works for the following example.
devtools::install_github("agnesdeng/mixgb")
library(mixgb)
set.seed(2021)
n=nrow(iris)
idx=sample(1:n, size = round(0.7*n), replace=FALSE)
train.df=iris[idx,]
test.df=iris[-idx,]
trainNA.df=createNA(train.df,p=0.3)
testNA.df=createNA(test.df,p=0.3)
##
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist")
mixgb.obj=MIXGB$impute(m=5)
##
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist", gpu_id=3)
mixgb.obj=MIXGB$impute(m=5)
test.impute=impute.new(object = mixgb.obj, newdata = testNA.df)
test.impute
I have no problem running this. However, since my machine doesn't have multiple GPUs , when I set gpu_id=3
, I would get a warning message "WARNING: ../src/learner.cc:231: Only 1 GPUs are visible, setting gpu_id
to 0".
Thanks. at least the first step runs now for me.
mixgb.obj <- MIXGB$impute(m=5)
Few observations
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix does not support construction from double.
> df.test.imputed <- impute.new(object = mixgb.obj, newdata = df.test)
Error in validObject(.Object) :
invalid class “dgCMatrix” object: lengths of slots 'i' and 'x' must match
When using Mixgb$new(data=...)
or Mixgb.train$new(data=...)
, data should be a dataframe, not a dgCMatrix. Did you feed in a xgb.DMatrix object instead?
Our package will automatically convert it for you.
library(mixgb)
set.seed(2021)
**#iris is a dataframe**
n=nrow(iris)
idx=sample(1:n, size = round(0.7*n), replace=FALSE)
**#train.df or test.df also are dataframes**
train.df=iris[idx,]
test.df=iris[-idx,]
**#we create some missing values**
trainNA.df=createNA(train.df,p=0.3)
testNA.df=createNA(test.df,p=0.3)
**##when we feed in the training data, it's still a dataframe trainNA.df, users do need to covert it to dgCMatrix by themselves**
MIXGB=Mixgb.train$new(trainNA.df,tree_method="gpu_hist")
As for the speed, if your dataset has fewer observations but a large number of multiclass categorical variables (eg, 1 categorical variable with 100 classes), missRanger seems to run faster but the imputation quality hasn't been investigated yet.
It is dataframe. I usually dummy code before imputation. Sorry I had spoken too soon earlier. Different servers have the same problem. Maybe I can share the rds file later if that helps
Sharing a list object. Has both test/train dataframes https://www.dropbox.com/s/hl3y47wpnxsaghc/for_github.rds?dl=0
Hi Asheetal,
Our package doesn't require dummy coding before imputation since we will automatically convert it. The variables in a dataframe should be numeric, binary, or multiclass-categorical.
Thanks, let me know if the above dropbox link works. Curious if this data works at your end.
Does this dataframe work at your end? Wondering if similar newdata feature is on roadmap for autoencoder?
Yes. The dropbox link works. There are two data frames in the list. I tried one of them. It worked when I trained the whole dataset but it showed errors when I used the Mixgb.train for 70% training data. It turns out that some variables in the subsetted data only have one missing value and predicting a vector using a xgboost model is not allowed by XGBoost (It requires a matrix form).
I fix this and it should work under this kind of scenario. I tried one of your data frames and it worked too. Feel free to reinstall the package mixgb and have a try. Let me know if there are more errors. Your feedback is much appreciated.
I notice that it does take quite a long time to impute your dataset as one categorical variable has 40 levels. For this type of datasets, it would be much faster to use autoencoder to impute. However, we are still investigating the imputation performance of autoencoder.
Does this dataframe work at your end? Wondering if similar newdata feature is on roadmap for autoencoder?
Yes, imputing new data will be added to autoencoder imputers soon. Currently, I am working on migrating my code from tensorflow1 to tensorflow2. After that, I should be available to add more features to autoencoder imputer in the package misle. I will let you know when it's done.
Thanks......I get a warning that is odd. Firstly, there is no folder there, secondly there is no XGBoost 1.3 in the system. My system has XGBoost 1.5.dev. Is this some fixed message from mixgb? Will test newdata and report soon.
WARNING: /home/asheetal/Downloads/xgboost/src/learner.cc:1094: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
This is a warning message from XGBoost, not from mixgb. Mainly it tells users that starting from XGBoost version 1.3.0, the default setting for the error metric has been changed. If users prefer to use the old default setting, they can set it manually.
I may try to suppress this warning for a later update of mixgb .
I do machine learning research using large datasets and missRanger was my goto package earlier, however even that took 1-2 months to impute. Really excited to see an alternative. A curiosity about xgboost. Does this use the xgboost.so file or does it recompile the xgboost binary? In other words, can it handle the cuda libraries? That would really be useful to me in terms of speedup. I have an xgboost.so precompiled that uses the mutli-GPU system setup.
specifically support for parameters, tree_method and gpu_id in your package