blind-contours / CVtreeMLE

:deciduous_tree: :dart: Cross Validated Decision Trees with Targeted Maximum Likelihood Estimation
MIT License
5 stars 1 forks source link

Non reproducible results #9

Closed GaryBAYLOR closed 2 years ago

GaryBAYLOR commented 2 years ago

When setting the same seed, for example, set.seed(429153), I get different results from those in README file, and for different runs with the same seed value, the results are different. Is this expected behavior of the function CVtreeMLE? Can you explain why this happens since we set the same seed value?

res1 = get_results(429153)
res2 = get_results(429153)
res3 = get_results(429153)

> res1$RMSE_results
# A tibble: 4 × 2
  `Var(s)`   RMSE
  <chr>     <dbl>
1 M1       0.0629
2 M2       0.0635
3 M3       0.0623
4 M1M2M3   0.0332
> res2$RMSE_results
# A tibble: 4 × 2
  `Var(s)`   RMSE
  <chr>     <dbl>
1 M1       0.0659
2 M2       0.0624
3 M3       0.0620
4 M1M2M3   0.0385
> res3$RMSE_results
# A tibble: 4 × 2
  `Var(s)`   RMSE
  <chr>     <dbl>
1 M1       0.0609
2 M2       0.0624
3 M3       0.0623
4 M1M2M3   0.0426

The function get_results is defined as below.

get_results = function(seed) {

  if(missing(seed)) seed = sample(100000, 1)
  print(seed)

  set.seed(seed)

  n_obs <- 500 
# split points for each mixture
splits <- c(0.99, 2.0, 2.5) 
# minimum values for each mixture
mins <- c(0, 0, 0) 
 # maximum value for each mixture
maxs <- c(3, 4, 5)
 # mu for each mixture
mu <- c(0, 0, 0)
# variance/covariance of mixture variables
sigma <- matrix(c(1, 0.5, 0.8, 0.5, 1, 0.7, 0.8, 0.7, 1), nrow = 3, ncol = 3) 
# subspace probability relationship with covariate W1
w1_betas <- c(0.0, 0.01, 0.03, 0.06, 0.1, 0.05, 0.2, 0.04) 
# subspace probability relationship with covariate W2
w2_betas <- c(0.0, 0.04, 0.01, 0.07, 0.15, 0.1, 0.1, 0.04) 
 # probability of mixture subspace (for multinomial outcome generation)
mix_subspace_betas <- c(0.00, 0.08, 0.05, 0.01, 0.05, 0.033, 0.07, 0.09)
# mixture subspace impact on outcome Y, here the subspace where M1 is lower and 
# M2 and M3 are higher based on values in splits
subspace_assoc_strength_betas <- c(0, 0, 0, 0, 0, 0, 6, 0) 
# marginal impact of mixture component on Y
marginal_impact_betas <- c(0, 0, 0) 
# random error
eps_sd <- 0.01 
# if outcome is binary
binary <- FALSE

sim_data <- simulate_mixture_cube(
  n_obs = n_obs, 
  splits = splits,
  mins = mins,
  maxs = maxs,
  mu = mu,
  sigma = sigma,
  w1_betas = w1_betas,
  w2_betas = w2_betas,
  mix_subspace_betas = mix_subspace_betas,
  subspace_assoc_strength_betas = subspace_assoc_strength_betas,
  marginal_impact_betas = marginal_impact_betas,
  eps_sd = eps_sd,
  binary = binary
)

lrnr_glm <- Lrnr_glm$new()
lrnr_bayesglm <- Lrnr_bayesglm$new()
lrnr_gam <- Lrnr_gam$new()
lrnr_lasso <- Lrnr_glmnet$new(alpha = 1)
lrnr_earth <- Lrnr_earth$new()
lrnr_ranger <- Lrnr_ranger$new()
# put all the learners together (this is just one way to do it)
learners <- c(lrnr_glm, lrnr_bayesglm, lrnr_gam, lrnr_ranger)

Q1_stack <- make_learner(Stack, learners)

lrnr_glmtree_001 <- Lrnr_glmtree$new(alpha = 0.5, maxdepth = 3)
lrnr_glmtree_002 <- Lrnr_glmtree$new(alpha = 0.6,  maxdepth = 4)
lrnr_glmtree_003 <- Lrnr_glmtree$new(alpha = 0.7, maxdepth = 2)
lrnr_glmtree_004 <- Lrnr_glmtree$new(alpha = 0.8, maxdepth = 1)

learners <- c( lrnr_glmtree_001, lrnr_glmtree_002, lrnr_glmtree_003, lrnr_glmtree_004)
discrete_sl_metalrn <- Lrnr_cv_selector$new()

tree_stack <- make_learner(Stack, learners)

discrete_tree_sl <- Lrnr_sl$new(
  learners = tree_stack,
  metalearner = discrete_sl_metalrn
)

ptm1 <- proc.time()

sim_results <- CVtreeMLE(data = sim_data,
                         W = c("W", "W2"),
                         Y = "y",
                         A = c(paste("M", seq(3), sep = "")),
                         back_iter_SL = Q1_stack,
                         tree_SL = discrete_tree_sl, 
                         n_folds = 2,
                         family = "gaussian")

ptm2 <- proc.time()

RMSE_results <- sim_results$`Model RMSEs`
mixture_results <- sim_results$`Pooled TMLE Mixture Results`
mixture_v_results <- sim_results$`V-Specific Mix Results`

return(list(seed=seed, RMSE_results= RMSE_results, mixture_results= mixture_results, mixture_v_results= mixture_v_results, time=ptm2-ptm1))
}
blind-contours commented 2 years ago

This has been fixed. Issue was that seeds created in the parallelized section would override the seed created. I've switched everything to furrr::future_map now and made sure the seed is set in each function. Now - users pass a seed number and this is used in each function. The above test now gives the same results for each seed.