datashield / dsBaseClient

DataSHIELD client side base functions
https://www.datashield.org/
GNU General Public License v3.0
11 stars 24 forks source link

ds.glm() bug: "no input has determined the number of cases" #439

Closed arrjon closed 2 years ago

arrjon commented 2 years ago

I have the data x and y, and I want to regress log_y on x, so I computed log_y with ds.log() and then I get the following error when I try to use the ds.glm() function:

$study1 [1] "Command 'glmDS2(log_y ~ x, \"gaussian\", \"0,0\", NULL, NULL, \n \"D\")' failed on 'study1': Error while evaluating 'dsBase::glmDS2(log_y ~ x, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n no input has determined the number of cases\n"

This error does not occur if I directly regress y on x. So what is the problem here?

EDIT: The problem only occurs if one wants to regress log_y on log_x, so both variables have to be precomputed with ds.log(). If one variable comes directly from the datasource, there is no problem.

davraam commented 2 years ago

Does the y variable includes only positive values? If y has any negative values or zeros then log(y) will produce NaNs and -Inf, and glm will return an error.

arrjon commented 2 years ago

Yes, the variable has only positive values. Even when I just assign y to an new object with ds.assign(), say y2, the problem occurs when calling ds.glm(). Also checking the variable with ds.isNA() (FALSE) and ds.isValid (TRUE) gives no indication why the regression should fail.

But, I have to correct my first statement: The problem only occurs if one wants to regress log_y on log_x, so both variables have to be precomputed with ds.log(). If one variable comes directly from the datasource, there is no problem.

arrjon commented 2 years ago

Here is a full working example from test data sets "CNSIM":

require('DSI')
require('DSOpal')
require('dsBaseClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1", 
                  url = "http://192.168.56.100:8080/", 
                  user = "administrator", password = "datashield_test&", 
                  table = "CNSIM.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2", 
                  url = "http://192.168.56.100:8080/", 
                  user = "administrator", password = "datashield_test&", 
                  table = "CNSIM.CNSIM2", driver = "OpalDriver")
builder$append(server = "study3",
                  url = "http://192.168.56.100:8080/", 
                  user = "administrator", password = "datashield_test&", 
                  table = "CNSIM.CNSIM3", driver = "OpalDriver")
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D") 

ds.log("D$LAB_TSC", newobj="LOG_LAB_TSC", datasources=connections)
ds.log("D$LAB_HDL", newobj="LOG_LAB_HDL", datasources=connections)

mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
               data = "D",
               family = "gaussian",
               datasources = connections)
datashield.errors()

And the error is:

$study1
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n    NULL, NULL, \"D\")' failed on 'study1': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n  no input has determined the number of cases\n"

$study2
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n    NULL, NULL, \"D\")' failed on 'study2': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n  no input has determined the number of cases\n"

$study3
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n    NULL, NULL, \"D\")' failed on 'study3': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n  no input has determined the number of cases\n"
davraam commented 2 years ago

Thanks for sharing the code.

So the reason that you get this error is because all the assign functions in DataSHIELD assign the new objects as separated objects and not as vectors in an input dataframe.

Therefore, if you don't specify a name of a dataframe in the glm function it should work:

mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
              family = "gaussian",
              datasources = connections)

Otherwise you can add the two created vectors as columns in dataframe D (using the ds.dataFrame() function) and then the following should also work:

ds.dataFrame(x = c("D", "LOG_LAB_TSC", "LOG_LAB_HDL"), newobj = "D")
mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
              data = "D",
              family = "gaussian",
              datasources = connections) 
arrjon commented 2 years ago

Perfect! Thank you very much, both methods worked!