aiparragirre / wlasso

R package to fit LASSO regression models to complex survey data
https://doi.org/10.1002/sta4.578
2 stars 0 forks source link

Error in sample.int(length(x), size, replace, prob) : invalid first argument #2

Open hkim89 opened 10 months ago

hkim89 commented 10 months ago

Hello. I have tried to run wlasso with NHANES data with new version you recently updated, but still I have errors:

L1.tr<- wlasso(col.y = "Y",col.x = c(3,6:72), data = wTrain1,cluster = "SDMVPSU",strata = "SDMVSTRA",weights="WTSAF8YR" ,family ="binomial",method = "dCV",k=5,R=10)

The errors found in traceback:

Error in sample.int(length(x), size, replace, prob) : 
invalid first argument
9. sample.int(length(x), size, replace, prob)
8. NextMethod("[")
7.`[.factor`(x, sample.int(length(x), size, replace, prob))
6. x[sample.int(length(x), size, replace, prob)]
5. sample(psu.h, 1)
4. f.folds(data, k = k, seed = seeds[r], strata = strata, cluster = cluster)
3. cv.folds(data, k, weights, seed, strata, cluster, R)
2. replicate.weights(data = data, method = method, cluster = cluster, 
strata = strata, weights = weights, k = k, R = R, B = B, 
train.prob = train.prob, method.split = method.split, seed = seed)
1. wlasso(col.y = "Y", col.x = c(3, 6:72), data = wTrain1, cluster = "SDMVPSU", 
strata = "SDMVSTRA", weights = "WTSAF8YR", family = "binomial", 
method = "dCV", k = 5, R = 10, seed = 100)

or using svydesign

des<- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTSAF8YR, nest=TRUE,data=wTrain1) 
L1<- wlasso(col.y = "Y",col.x = c(3,6:72), design=des ,family ="binomial",method = "dCV",k=5,R=10 )

yields

Error in if (cluster == "1") { : argument is of length zero

I have found the where they happened on your code file but still do not understand. glmnet function in wlasso supports factor variables. This processed NHANES data includes factor variables, FYI. Thanks in advance.

aiparragirre commented 10 months ago

Sorry for the delay in the response. I am trying to figure out where the problems you get may come from. Reading the errors you get, I also identify the part of the code that gives you the error, but I do not understand why.

On the one hand, the glmnet of the wlasso does not handle factor covariates for the moment (we are working on improving the package to resolve this issue). Meanwhile, you may try to use dummy variables instead. So, first of all, could you try to run the code again, omitting the factor variables, and let me know if you still get errors?

On the other hand, I believe the problems may be related to another part of the code that is not related to factor covariates. Is the dataset you are using publicly available? If not, would it be possible for me to have a "toy-example" of the dataset in order to analyze the error more deeply by myself, in case the errors are not solved when omitting factor variables?

Thank you in advance.

hkim89 commented 10 months ago

I have been going through the code after download wlasso from the github. I have found minor bugs when running wlasso with the svydesign object: R-survey library generates call including id not ids.

# Step 0: Notation
  if(!is.null(design)){ 
    cluster <- as.character(design$call$id[2])
    if(cluster == "1"){
      cluster <- NULL
    }
    strata <- as.character(design$call$strata[2])
    weights <- as.character(design$call$weights[2])
    data <- get(design$call$data)
  }

You may notice that glmnet only supported matrices for inputs, but now it works with data.frames in conjunction with glmnetUtils (https://cran.r-project.org/web/packages/glmnetUtils/vignettes/intro.html or the pdf manual) including factors. If you include glmnetUtils with glmnet in wlasso, wlasso can handle factor variables. I keep tracking the code why the error happens. NHANES data is public, so you can find it from https://wwwn.cdc.gov/nchs/nhanes/default.aspx, which need quite a bit of time to preprocess. I can provide a small test set. The data is the last 2 year cycle of 4 - 2year cycle data sets, so that the weights were recomputed as WTSAF8YR. SEQN, SDDSRVYR, SDMVPSU ,SDMVSTRA are ids, survey year labeled as 8 for 2013-2014 cycle, psuedo PSU, and strata, respectively. The rest are explanatory variables including factor variables, and Y for a dependent variable, FYI.

Thanks to your clear explanation of the paper with the code in full transparency-- since I have struggled with many recent papers having many intentional typos from other quantitative fields (not math), I could understand your code for the replicate weights for dCV and other methods code-wise and theory-wise and how it works. I may be able to figure out how to fix it and to regenerate to Julia code. I will let you know when to figure the error out. Thanks for the response. NHANES13-14.csv

aiparragirre commented 10 months ago

Thank you for sharing a small subset of the NHANES data you are using. It was really helpful to better understand the problems you are having with the package. I incorporated some improvements to the package, which I believe may solve most of the problems you have commented on.

The bugs with the design object were solved by using id object rather than ids. Also, reading the documentation of the survey package I also realized that ~0 can also be used to indicate "no cluster", so now the functions consider this option too:

# Step 0: Notation
  if(!is.null(design)){
    cluster <- as.character(design$call$id[2])
    if(cluster == "1" || cluster == "0"){
      cluster <- NULL
    }
    strata <- as.character(design$call$strata[2])
    weights <- as.character(design$call$weights[2])
    data <- get(design$call$data)
  } 

The problems with the dCV method were solved by making a small change in the f.folds() function. Specifically, names() was added in line 337 of replicate.weights.R:

  h.onefold <- names(which(h.fold.sum==1))

Please check and let me know if these changes have solved the problems you reported.

Regarding the issue with the factor covariates, even if I use glmnetUtils I still keep having some troubles, so for the moment, I have not changed the wlasso() function to incorporate them. However, I keep working to resolve the problems with this limitation and the changes will be incorporated into the package as soon as possible.

Thank you for sharing your troubles with the package and helping to improve it!

hkim89 commented 10 months ago

Thanks for the quick update. I see the new version wlasso reach to the same error when using with design object and defining each component. This error happens when factor variables are included in the data.

Error in h(simpleError(msg, call)) : 
error in evaluating the argument 'x' in selecting a method for function 'as.matrix': object of invalid type "character" in 'matrix_as_dense()'
10.
h(simpleError(msg, call))
9.
.handleSimpleError(function (cond) 
.Internal(C_tryCatchHelper(addr, 1L, cond)), "object of invalid type \"character\" in 'matrix_as_dense()'", 
base::quote(cbind2(1, newx) %*% nbeta))
8.
cbind2(1, newx) %*% nbeta
7.
cbind2(1, newx) %*% nbeta
6.
as.matrix(cbind2(1, newx) %*% nbeta)
5.
predict.glmnet(model, newx = as.matrix(newdata[, col.x]), type = "response")
4.
NextMethod("predict")
3.
predict.lognet(model, newx = as.matrix(newdata[, col.x]), type = "response")
2.
predict(model, newx = as.matrix(newdata[, col.x]), type = "response")
1.
wlasso(col.y = "Y", col.x = c(3, 6:72), data = wTrain1, cluster = "SDMVPSU", 
strata = "SDMVSTRA", weights = "WTSAF8YR", family = "binomial", 
method = "dCV", R = 10)

When I used sparse.model.matrix to convert one-hot coded variables from factor variables getting a hint from glmnet in glmnetUtils, wlasso ran without errors. glmnetUtil::glmnet can also have an option sparse =TRUE to create a sparse matrix inside the function. You may be able to extend wlasso to having more flexible input options.

Above error indicates that glmnet was working with factor variables, but the error happens when estimating yhat (predicted). In fact, glmnetUtils also supports predict using data.frame for newdata, alpha=1 for lasso. The problem when using as.matrix is to convert all elements in the data frame to characters, so that it is better to use sparse.model.matrix for factor variables to be one-hot coded as well as for memory efficiency. I will be able to write a temporary function for wlasso to run with a sparse matrix and try with Elastic Net since I see how weights from survey data act in conjunction with ML, which I am working on, thanks to your code.

Another question: When using wlasso function for variable selection, setting R=1 is sufficient for variable selection or do you recommend more than once to replicate weights R>>1 even if 10 fold dCV is operated per R?