Closed arrjon closed 2 years ago
Does the y variable includes only positive values? If y has any negative values or zeros then log(y) will produce NaNs and -Inf, and glm will return an error.
Yes, the variable has only positive values. Even when I just assign y to an new object with ds.assign()
, say y2, the problem occurs when calling ds.glm()
. Also checking the variable with ds.isNA()
(FALSE) and ds.isValid
(TRUE) gives no indication why the regression should fail.
But, I have to correct my first statement: The problem only occurs if one wants to regress log_y on log_x, so both variables have to be precomputed with ds.log()
. If one variable comes directly from the datasource, there is no problem.
Here is a full working example from test data sets "CNSIM":
require('DSI')
require('DSOpal')
require('dsBaseClient')
builder <- DSI::newDSLoginBuilder()
builder$append(server = "study1",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM1", driver = "OpalDriver")
builder$append(server = "study2",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM2", driver = "OpalDriver")
builder$append(server = "study3",
url = "http://192.168.56.100:8080/",
user = "administrator", password = "datashield_test&",
table = "CNSIM.CNSIM3", driver = "OpalDriver")
logindata <- builder$build()
connections <- DSI::datashield.login(logins = logindata, assign = TRUE, symbol = "D")
ds.log("D$LAB_TSC", newobj="LOG_LAB_TSC", datasources=connections)
ds.log("D$LAB_HDL", newobj="LOG_LAB_HDL", datasources=connections)
mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
data = "D",
family = "gaussian",
datasources = connections)
datashield.errors()
And the error is:
$study1
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n NULL, NULL, \"D\")' failed on 'study1': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n no input has determined the number of cases\n"
$study2
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n NULL, NULL, \"D\")' failed on 'study2': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n no input has determined the number of cases\n"
$study3
[1] "Command 'glmDS2(LOG_LAB_TSC ~ LOG_LAB_HDL, \"gaussian\", \"0,0\", \n NULL, NULL, \"D\")' failed on 'study3': Error while evaluating 'dsBase::glmDS2(LOG_LAB_TSC ~LOG_LAB_HDL, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n no input has determined the number of cases\n"
Thanks for sharing the code.
So the reason that you get this error is because all the assign functions in DataSHIELD assign the new objects as separated objects and not as vectors in an input dataframe.
Therefore, if you don't specify a name of a dataframe in the glm function it should work:
mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
family = "gaussian",
datasources = connections)
Otherwise you can add the two created vectors as columns in dataframe D (using the ds.dataFrame() function) and then the following should also work:
ds.dataFrame(x = c("D", "LOG_LAB_TSC", "LOG_LAB_HDL"), newobj = "D")
mod <- ds.glm(formula = "LOG_LAB_TSC ~ LOG_LAB_HDL",
data = "D",
family = "gaussian",
datasources = connections)
Perfect! Thank you very much, both methods worked!
I have the data x and y, and I want to regress log_y on x, so I computed log_y with ds.log() and then I get the following error when I try to use the ds.glm() function:
$study1 [1] "Command 'glmDS2(log_y ~ x, \"gaussian\", \"0,0\", NULL, NULL, \n \"D\")' failed on 'study1': Error while evaluating 'dsBase::glmDS2(log_y ~ x, \"gaussian\", \"0,0\", NULL, NULL, \"D\")' -> Error in stats::complete.cases(all.data) : \n no input has determined the number of cases\n"
This error does not occur if I directly regress y on x. So what is the problem here?
EDIT: The problem only occurs if one wants to regress log_y on log_x, so both variables have to be precomputed with
ds.log()
. If one variable comes directly from the datasource, there is no problem.