Closed ApproximateIdentity closed 9 years ago
There are at least two ways to accomplish this using Cyclops:
Great I'm taking the first approach. Thanks for the help!
Apparently I'm not totally understanding what the first approach entails. I'm trying to do cross-validation with the weights, but I get the following error:
Error in fitCyclopsModel(cyclopsData, prior = prior, control = control, :
Can not set data weights and use cross-validation simultaneously
I'm obviously doing something basic wrong. What is the right way to do this? I'm pasting in almost all the code to reproduce the error. The initial cohortData object is in the one returned by dbGetCohortData in Cohort Method. Most of the code isn't necessary, but does provide context. Only the last couple lines matter from the Cyclops perspective.
names(cohortData)
# [1] "outcomes" "cohorts" "covariates" "exclude" "covariateRef"
# [6] "metaData"
names(cohortData$cohorts)
# [1] "rowId" "treatment" "personId"
# [4] "timeToObsPeriodEnd" "timeToCohortEnd"
names(cohortData$covariates)
# [1] "rowId" "covariateId" "covariateValue"
cohortSubset <- cohortData$cohorts
covariateSubset <- ffbase::subset.ffdf(cohortData$covariates, covariateId != 1)
colnames(cohortSubset)[colnames(cohortSubset) == "treatment"] <- "y"
cyclopsData <- convertToCyclopsData(
cohortSubset,
covariateSubset,
modelType="lr",
quiet=TRUE,
checkSorting = TRUE)
prior = createPrior("laplace", exclude = c(0), useCrossValidation = TRUE)
control = createControl(noiseLevel = "silent", cvType = "auto", startingVariance = 0.1)
num_rows <- nrow(cohortSubset)
weights <- sample(0:1, num_persons, replace=TRUE, prob=c(0.2, 0.8))
# This fails!
cyclopsFit <- fitCyclopsModel(
cyclopsData,
prior = prior,
control = control,
weights = weights)
# Error in fitCyclopsModel(cyclopsData, prior = prior, control = control, :
# Can not set data weights and use cross-validation simultaneously
Thanks for all the help!
Ah, yes, that's a use-case I had not anticipated; there are weights and then a cross-validation within the weighted data.
Right now, cross-validation inside Cyclops
overwrites any weights specified in R
-- hence the stop("nasty message")
. However, this use-case seems important and is certainly possible after a couple of small changes to CrossValidationSelector.cpp
to respect the existing weights and a unit-test.
@ApproximateIdentity, can you mock up a simple, small example (like a 1000 x 10 logistic regression), splitting into test- and train-sets, fitting the train-set separately with cross-validation in Cyclops and then predicting on the test-set? I'll use that as a gold standard in a unit-test.
Before I continue doing the the test/train parts, is this the sort of thing you're looking for? Right now this will fail if you run it as Rscript unittest.R
:
unittest.R
library(Cyclops)
main <- function() {
data <- generate_data()
cyclopsData <- convertToCyclopsData(
data$y,
data$X,
modelType="lr",
quiet=TRUE,
checkSorting = TRUE)
prior = createPrior("laplace", exclude = c(0), useCrossValidation = TRUE)
control = createControl(noiseLevel = "silent", cvType = "auto",
startingVariance = 0.1)
num_rows <- length(unique(data$X$rowId))
weights <- sample(0:1, num_rows, replace=TRUE, prob=c(0.2, 0.8))
# This fails!
cyclopsFit <- fitCyclopsModel(
cyclopsData,
prior = prior,
control = control,
weights = weights)
}
generate_data <- function(num_persons = 1000,
num_cov = 2000,
max_num_cov_per_person = 10) {
num_cov_per_person <- sample(1:max_num_cov_per_person,
num_persons, replace=TRUE)
cov_per_person <- lapply(num_cov_per_person,
function(x) sample(1:num_cov, x))
# Generate X matrix.
X <- data.frame()
for (person in 1:num_persons) {
covs <- cov_per_person[[person]]
sub_X = data.frame(
rowId = rep(person, length(covs)),
covariateId = covs,
covariateValue = rep(1, length(covs)))
X <- rbind(X, sub_X)
}
# Sort X matrix.
X <- X[with(X, order(rowId, covariateId)), ]
# Generate y vector.
y <- data.frame(
rowId = 1:num_persons,
y = sample(0:1, num_persons, replace = TRUE))
data <- list(X = X, y = y)
return(data)
}
main()
I decided to try to implement this myself, but I know I'm not doing it right. This is partially due to my lack of knowledge of the code-base and probably even more due to my not really knowing C++. I was able to add in the pieces that seem philosophically necessary. What I did was change the CrossValidationSelector.getWeights to take an extra parameter (called base_weights) which it then updates with the cross-validation weights.
In order to do this I needed to make many changes to other files so that things would continue to compile. Most of those changes are probably totally wrong (not even "philosophically correct") since they were basically only hacked in so everything would compile. They only thing that I hope is mostly correct is the code necessary to run the unittest.R script from my previous post.
I pushed the changes I made to the following branch:
https://github.com/OHDSI/Cyclops/tree/cross_val_weights
It might be helpful for you to look at it. I personally find it easier to look at others' changes (even if they're bad) when thinking through how I'd do it. Feel free to ignore it, change it, or tell me to change things. Whatever's convenient for you.
Thanks for everything!
Looking good @ApproximateIdentity. I'd make the following API change:
R
) between float and double.Question about internals: I am not sure what you've done will work when, say, the first X% are weighted out. Basically, your weights need to influence the intervalStart
otherwise you'll end-up with some batch sizes of 0 and not true X-fold partitioning. See, e.g. in CrossValidationSelector.cpp
std::copy(
permutation.begin() + intervalStart[batch],
permutation.begin() + intervalStart[batch + 1],
insert_iterator< std::set<int> >(excludeSet, excludeSet.begin())
);
You'll probably want to send the existing CCD.getWeights() to the AbstractSelector c'tor (as for example weightsExclude) so that intervalStart
get properly constructed.
Some things to check before you merge:
byStratum
and not just byRow
(the conditioned models)?best, M
Thanks for all the info! I'll clean up the get/set stuff. I noticed the inconsistency and just picked one of them at random. It's easy to put that back.
As to the internals, what you're saying definitely is part of what isn't working right for me. I've done internal logging to tell me how many weights make it through and the fact that I'm not handling those intervals correctly does pretty explicitly show up. I see what you're saying about sending the weights to the AbstractSelector and that does seems intuitively right. I'll try to throw that in.
As to the testing, it certainly can't be passing since I haven't run them and I know of problems myself independent of that. I was initially just trying to actually grok what was going on and to be able to make superficial changes that weren't totally ridiculous. So what makes most sense really is for me to actually make sure the tests pass and to add in some more for the newer stuff. That way I'll have something to code against.
I'll keep you updated. Hopefully I can increase the quality from just "compilable" to "vaguely rational" soon. Thanks for all the help!
Quick question. Are these test failures normal? This is being run off the master branch before any of my changes.
Sorting outcomes by stratumId and rowId
Sorting covariates by stratumId and rowId
Sorting outcomes by stratumId and rowId
Sorting covariates by stratumId and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sparseness = 75.815 %
1. Failure (@test-dataConversionStratified.R#156): Test conditional poisson regression
as.vector(sort(coef(fit))) not equal to as.vector(sort(coef(fitFormula)))
2/2 mismatches (average diff: 3.9e-07).
First 2:
pos x y diff
1 -0.00255 -0.00255 3.3e-07
2 0.00208 0.00208 4.5e-07
2. Failure (@test-dataConversionStratified.R#157): Test conditional poisson regression
as.vector(sort(coef(fit))) not equal to as.vector(sort(coef(gold)))
2/2 mismatches (average diff: 3.13e-07).
First 2:
pos x y diff
1 -0.00255 -0.00255 -9.73e-08
2 0.00208 0.00208 5.28e-07
3. Failure (@test-dataConversionStratified.R#158): Test conditional poisson regression
as.vector(sort(coef(fitFfdf))) not equal to as.vector(sort(coef(gold)))
2/2 mismatches (average diff: 3.13e-07).
First 2:
pos x y diff
1 -0.00255 -0.00255 -9.73e-08
2 0.00208 0.00208 5.28e-07
Sorting covariates by rowId
Sorting covariates by rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sparseness = 75.41 %
testthat results ================================================================
OK: 126 SKIPPED: 5 FAILED: 3
1. Failure (@test-dataConversionStratified.R#156): Test conditional poisson regression
2. Failure (@test-dataConversionStratified.R#157): Test conditional poisson regression
3. Failure (@test-dataConversionStratified.R#158): Test conditional poisson regression
Error: testthat unit tests failed
In addition: Warning message:
In fun(libname, pkgname) : no DISPLAY variable so Tk is not available
Execution halted
master
should fail no test-units; it certainly passes on TravisCI. The issue above look like ff
problems. Try rerunning the tests are restarting R
via devtools::test(".")
I got side-tracked from the tests, but I'll get that sorted out. On another note, how do I use your built-in stream logging? I see many streams that would be useful to me and I should probably start using them instead of re-inventing the wheel the whole time. Is it something I turn on at compile-time?
There's nothing to turn on.
std::ostringstream stream;
stream << "Random message";
error->throwError(stream);
will throw an R stop
(or exit(-1)
if run from the command-line) and
std::ostringstream stream;
stream << "Random message"
logger->writeLine(stream);
will write to the R output (or STDOUT if run from command-line).
Okay the tests are fine (i.e. on master...not my branch which still fails as expected). The errors are somehow coming in with the way I'm custom building the software (so I can use a -O0 optimization flag which makes debugging easier...apparently my way of building causes bugs though...). Also thanks for the stream info. I didn't realize that's what the parameter in the control was for.
On another note, I'm trying to figure out the best way for the permutation to be set up. In the CrossValidationSelector's constructor I see that the intervalStart array is built. Since these weights are fixed for all of the CrossValidationSelectors in the pool, would it make sense to pass the base_weights into the constructor for that class and then use that information to setup the interval start? It seems to me like the most natural way at the moment. Does this seem like a bad idea to you? Am I making sense?
Thanks for all the help!
At construction seems like a reasonable place.
I have implemented a version of this on the following branch:
https://github.com/OHDSI/Cyclops/tree/cross_val_weightsv2
This seems to make much more logical sense. One thing I changed though was in R/ModelFit.R and it affects some other functions. I changed it so that if you pass in weights = NULL from R, you set all weights to 1 before calling into C++ code. I.e. see the changes made to ModelFit.R in this commit:
https://github.com/OHDSI/Cyclops/commit/53c386162beb606676579df47f4d89b4ca0c8152
This seems to streamline things from my perspective, but it does cause two errors to occur in other functionality. I tracked them down (I think) to the following hard-coded exception:
https://github.com/OHDSI/Cyclops/blob/master/src/cyclops/engine/ModelSpecifics.hpp#L882
I guess the issue is that when the weights are set, the boolean useWeights gets set as well which I believe is what is causing this problem.
So after all that, my main question is, is there any sense in changing it so that weights are always set, and are set to all 1s whenver they previously were not set? This seems to simplify some aspects from my perspective, but I could very well imagine that there are optimizations in the case of no weights that might be lost if the weights were set (even to 1s).
I presume I'll just need to go back and change my code to not set 1s for weights, but I figured I would ask before doing that. Also here is the output of devtools::test('.') in case you're interested:
> devtools::test('.')
Loading required package: testthat
Testing Cyclops
Loading Cyclops
Loading required package: splines
.......Waiting for profiling to be done...
[1] 0.6364936 1.0993638 1.5622341 2.0251044 2.4879746 2.9508449 3.4137152
[8] 3.8765855 4.3394557 4.8023260
[1] -2.5940592 -1.9715805 -1.3225559 -0.6600137 0.0000000 0.6418664
[7] 1.2536753 1.8287277 2.3650617 2.8639503
[1] -5.87399440 -5.14342461 -4.41285483 -3.68228504 -2.95171526 -2.22114547
[7] -1.49057568 -0.76000590 -0.02943611 0.70113367 1.43170346
[1] -2.562444 -2.266808 -1.935551 -1.557924 -1.119973 -0.605426 0.000000
[8] 0.698676 1.468945 2.265106 3.037878
.............SSLoading required package: bit
Attaching package bit
package:bit (c) 2008-2012 Jens Oehlschlaegel (GPL-2)
creators: bit bitwhich
coercion: as.logical as.integer as.bit as.bitwhich which
operator: ! & | xor != ==
querying: print length any all min max range sum summary
bit access: length<- [ [<- [[ [[<-
for more help type ?bit
Attaching package: ‘bit’
The following object is masked from ‘package:base’:
xor
Attaching package ff
- getOption("fftempdir")=="/tmp/RtmpfFY0zK"
- getOption("ffextension")=="ff"
- getOption("ffdrop")==TRUE
- getOption("fffinonexit")==TRUE
- getOption("ffpagesize")==65536
- getOption("ffcaching")=="mmnoflush" -- consider "ffeachflush" if your system stalls on large writes
- getOption("ffbatchbytes")==16777216 -- consider a different value for tuning your system
- getOption("ffmaxbytes")==536870912 -- consider a different value for tuning your system
Attaching package: ‘ff’
The following objects are masked from ‘package:bit’:
clone, clone.default, clone.list
The following objects are masked from ‘package:utils’:
write.csv, write.csv2
The following objects are masked from ‘package:base’:
is.factor, is.ordered
Sorting outcomes by stratumId and rowId
Sorting covariates by stratumId and rowId
Sorting outcomes by stratumId and rowId
Sorting covariates by stratumId and rowId
..Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
...Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
...Sparseness = 75.885 %
...Sorting covariates by rowId
Sorting covariates by rowId
..Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
Sorting outcomes by stratumId, time (descending), y, and rowId
Sorting covariates by stratumId, time (descending), y, and rowId
...Sparseness = 76.565 %
.............................S....................1...........SS...............2..............
1. Error: Small conditional logistic regression --------------------------------
c++ exception (unknown reason)
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"),
warning = function(c) invokeRestart("muffleWarning"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: expect_equal(vcov(cyclopsFit), vcov(gold), tolerance = tolerance) at test-smallCLR.R:23
5: expect_that(object, equals(expected, label = expected.label, ...), info = info, label = label)
6: condition(object)
7: compare(expected, actual, ...)
8: compare.numeric(expected, actual, ...)
9: all.equal(x, y, ...)
10: all.equal.numeric(x, y, ...)
11: attr.all.equal(target, current, tolerance = tolerance, scale = scale, ...)
12: mode(current)
13: vcov(cyclopsFit)
14: vcov.cyclopsFit(cyclopsFit)
15: .cyclopsGetFisherInformation(object$cyclopsInterfacePtr, NULL) at /home/tnyberg/github/ohdsi/Cyclops/R/ModelFit.R:676
16: stop(structure("c++ exception (unknown reason)", class = "try-error", condition = structure(list(
message = "c++ exception (unknown reason)", call = NULL), .Names = c("message",
"call"), class = c("simpleError", "error", "condition"))))
2. Error: Get SEs in small Poisson model ---------------------------------------
c++ exception (unknown reason)
1: withCallingHandlers(eval(code, new_test_environment), error = capture_calls, message = function(c) invokeRestart("muffleMessage"),
warning = function(c) invokeRestart("muffleWarning"))
2: eval(code, new_test_environment)
3: eval(expr, envir, enclos)
4: getSEs(cyclopsFit, c(1:5)) at test-smallPoisson.R:109
5: .cyclopsGetFisherInformation(object$cyclopsInterfacePtr, covariates) at /home/tnyberg/github/ohdsi/Cyclops/R/ModelFit.R:566
6: stop(structure("c++ exception (unknown reason)", class = "try-error", condition = structure(list(
message = "c++ exception (unknown reason)", call = NULL), .Names = c("message",
"call"), class = c("simpleError", "error", "condition"))))
Warning message:
In fun(libname, pkgname) : no DISPLAY variable so Tk is not available
Please only use weights when necessary. There are a number of optimizations that rely on useWeights == false
. Why read a memory location when one knows it will return 1?
I changed it back so that setWeights() is no longer called when NULL weights are passed. I need to do some internal testing, but I think this is shaping up to be okay enough to be worth moving this discussion into a pull request.
I have a final question before the "tear apart the code in pull request" phase. I need to make changes to CrossValidationSelector::permute() to take into account the fact that many weights are now 0:
https://github.com/OHDSI/Cyclops/blob/master/src/cyclops/drivers/CrossValidationSelector.cpp#L118
My main question is the following: what exactly is the purpose of weightsExclude/wtsExclude? I searched through the codebase and it seems to me that the only place it is ever passed in is here:
https://github.com/OHDSI/Cyclops/blob/master/src/cyclops/imputation/ImputeVariables.cpp#L156
and weightsMissing gets set by these functions:
https://github.com/OHDSI/Cyclops/blob/master/src/cyclops/imputation/ImputeVariables.cpp#L149 https://github.com/OHDSI/Cyclops/blob/master/src/cyclops/imputation/ImputeVariables.cpp#L152
I can follow through the types to roughly understand what's going on from a programmatic perspective and I can learn a lot from function names/file names, but I was wondering what the down-to-earth explanation of what these weights are exactly. I feel a little uncomfortable making some of these changes without this basic understanding.
Hopefully this isn't too big a bother. Thanks!
I just did more searching through the search tree and realized that it seems like all code in the imputation/ folder is unused. At least if I delete that folder everything compiles and links fine and all tests still pass. Is this old code that's no longer used?
The impute
folder was added by Sushil a couple of years ago. I do not know if the code in that directory is (1) complete, (2) works or (3) what is it supposed to do.
BTW, before you merge, your branch is failing Travis-CI.
Is it okay if I just remove the folder entirely then? It'll still be in the git history so it's not like it's lost. And it seems silly/confusing/error-prone to have dead code floating around.
And thanks for the heads up about Travis! I'll have to figure out what's going on there.
Hi Marc I've been trying to figure out what's going on with the Travis build problem and I'm getting a bit confused. Travis throws errors, but neither my other two build environments (IMEDS instance or local) throw errors. The only real difference I can find is that the Travis build system uses Ubuntu 12.04 LTS which has an older g++ compiler than the newer systems.
Regardless, the errors that Travis sees on my branch show up on the master branch as well. So far I don't think I've introduced them...though I could certainly be wrong.
One of the errors can be localized with the following script:
unittest.R
library("Cyclops")
library("testthat")
library("survival")
library("gnm")
tolerance <- 1E-6
gold.clogit <- clogit(event ~ exgr + agegr + strata(indiv) + offset(loginterval),
data = Cyclops::oxford)
dataPtr <- createCyclopsData(event ~ exgr + agegr + strata(indiv) + offset(loginterval),
data = Cyclops::oxford,
modelType = "clr")
cyclopsFit <- fitCyclopsModel(dataPtr,
prior = createPrior("none"))
print("* print(logLik(cyclopsFit))")
print(logLik(cyclopsFit))
print("* print(logLik(gold.clogit))")
print(logLik(gold.clogit))
print("* print(attributes(logLik(cyclopsFit))):")
print(attributes(logLik(cyclopsFit)))
print("* print(attributes(logLik(gold.clogit))):")
print(attributes(logLik(gold.clogit)))
expect_equal(logLik(cyclopsFit), logLik(gold.clogit))
I have installed a system very close to the Travis system in a virtual box. Running unittest.R gives the following errors on the respective branches:
master
Loading required package: methods
[1] "* print(logLik(cyclopsFit))"
'log Lik.' -10.08828 (df=2)
[1] "* print(logLik(gold.clogit))"
'log Lik.' -10.08828 (df=2)
[1] "* print(attributes(logLik(cyclopsFit))):"
$df
[1] 2
$nobs
[1] 38
$class
[1] "logLik"
[1] "* print(attributes(logLik(gold.clogit))):"
$df
[1] 2
$class
[1] "logLik"
Error: logLik(cyclopsFit) not equal to logLik(gold.clogit)
Attributes: < Length mismatch: comparison on first 2 components >
Execution halted
cross_val_weightsv2
Loading required package: methods
[1] "* print(logLik(cyclopsFit))"
'log Lik.' -10.08828 (df=2)
[1] "* print(logLik(gold.clogit))"
'log Lik.' -10.08828 (df=2)
[1] "* print(attributes(logLik(cyclopsFit))):"
$df
[1] 2
$nobs
[1] 38
$class
[1] "logLik"
[1] "* print(attributes(logLik(gold.clogit))):"
$df
[1] 2
$class
[1] "logLik"
Error: logLik(cyclopsFit) not equal to logLik(gold.clogit)
Attributes: < Length mismatch: comparison on first 2 components >
Execution halted
The issue seems to be the "nobs" attribute which is not showing up in the gold.clogit case. So maybe this isn't technically an issue with Cyclops?
Have you seen this kind of error before? I'll keep digging and see if I find something...
Okay after a lot of detective work, I found the problem causing the build errors in Travis. All of my earlier ideas of what the problem was were wrong. It's simply that the survival package apparently changed its interface (annoying...).
I.e. if you install survival version 2.37-7 instead of the current version 2.38-1 (which is what is available on cran), you'll get rid of the errors. I'm going to email their development list, but since they may want to keep it this way (or it may take a while to revert), we may want to do something else in the meantime. (Like ignore these errors?)
For convenience, here is the page with the different versions of survival (see "Old sources:"):
API changes in survival
are issues that, I believe, have been addressed in master
unit-tests.
The master branch definitely does not pass the tests when using survival 2.38-1 on my development box or in my virtual machine meant to be like the Travis build. On my boxes, these are the same tests my branch fails on as well.
There are branches that are passing Travis-CI. I have been working in new_interface
for a couple of weeks.
Oh I now see this commit in which you fixed this:
https://github.com/OHDSI/Cyclops/commit/18bd6a71497d160cbb59b24a8e216b57a6317791
This sort of thing seems like a bugfix (not really in Cyclops...) so does it not make sense to just merge this one change directly into master? That way master won't have errors and those branching from master won't see the error? I could just go ahead and do that myself.
In any case, now that I know these issues are independent of my changes, I'll try to finish up the changes I've been making.
Dead topic
I'm trying to do some analysis with at train/test setup, but I'm having trouble seeing how to do it with Cyclops. I've been using the predict function (https://github.com/OHDSI/Cyclops/blob/master/R/ModelFit.R#L474), but it seems (to me to be) pretty hard-wired to already taking a fit object. I'd like to pass it a fit object and the test set (or do something equivalent), but I can't see how to do that.
Is there a way to do this within Cyclops? I wanted to make sure I wasn't missing anything before I did my own home-cooked stuff. Thanks a lot!