conroylau / lpinfer

lpinfer: An R Package for Inference in Linear Programs
GNU General Public License v3.0
3 stars 5 forks source link

Multiple tuning parameter in `invertci` #89

Closed conroylau closed 3 years ago

conroylau commented 4 years ago

In several procedures, we allow the user to pass multiple tuning parameters. For instance, users can pass multiple tau parameters in the DKQS procedure. This could be problematic for the invertci function if we allow them to do so. The reason is that we are finding the p-value for different beta.tgt in order to construct the confidence interval. If we allow for multiple tuning parameters, then each parameter could give a different p-value for the same beta.tgt. Then this means that the decision to reject or not to reject at a certain beta.tgt could be different for different tuning parameters.   I cannot think of a perfect solution to this problem yet, but I have an idea as follows:


As a side note, I think the procedure of evaluating all the bootstrap replications of a stochastic component in lpmodel that is a function (as mentioned in issue #88) can be done only once even if the user passes multiple tuning parameters. This is because the LP of finding the logical bounds is independent on the tuning parameter.

a-torgovitsky commented 4 years ago

Is there anything currently in invertci that mentions a specific testing procedure? It would be nice if we could keep it procedure-invariant. Of course this requires each procedure having predictable and harmonized output.

conroylau commented 4 years ago

Currently, only chorussell is mentioned in invertci. The reason of including it is to prevent the code from doing the bisection method to construct the confidence interval.

a-torgovitsky commented 4 years ago

I see, so that must mean that invertci expects a testing procedure to return an output that is standardized across procedures. Is that correct? What does this output look like?

conroylau commented 4 years ago

Yes that's correct. invertci is extracting the p-values via the pval object in the return list of the testing procedures. It is a data frame where the first column contains the tuning parameter and the second column contains the p-value. The number of rows of the data frame refers to the number of tuning parameters used.

For example:

rm(list = ls())
devtools::load_all()
source("../lpinfer/example/dgp_missingdata.R")

set.seed(1)
data <- missingdata_draw(J = 5, n = 500, seed = 1)
lpm <- missingdata_lpm(J = 5, info = "mean", data = data)
r <- dkqs(data, lpm, beta.tgt = .5, tau = c(.1, .5))
print(r$pval)

gives the following table of p-values:

  tau p-value
1 0.1    0.65
2 0.5    0.69

Currently, I am extracting the [1,2]-entry of the pval object from the return list for the p-value when there is only one tuning parameter.

a-torgovitsky commented 4 years ago

Two ideas, one easy and one harder:

Easy

Only compute a confidence interval for the first row. Warn the user that this is what is being done and that it would be computationally faster to pass one tuning parameter at a time.

Harder

  1. First compute the confidence interval for the first row, tau = .1.
  2. Along the way you have accumulated some information about which points would be inside or outside the confidence interval for tau = .5. Use that information to set a good bracket for tau = .5
  3. Compute compute interval for tau = .5 starting with the good bracket determined in step 2.
  4. Repeat for all rows.

The "harder" solution is still not a great I guess because there is some work being repeated. i.e. when computing the confidence interval for tau = .5 we are testing tau = .1 points for which we already know the answer.

What do you think?

conroylau commented 4 years ago

I see. I am thinking of extending the harder solution as follows:

  1. Based on the first iteration, we can get a data frame pval that is returned from the testing procedure. This consists of the name of the tuning parameter as well as the p-values for each value of the tuning parameter.
  2. Continue to construct the confidence interval for the first row tau = .1 by modifying the farg argument (say by farg$tau <- .1). This should be possible because we know the name of the tuning variable from the pval data frame. In addition, this step can be skipped if pval only has one row because there is only one tuning parameter.
  3. After the confidence interval for tau = .1 is constructed, go to the next parameter tau = .5.
  4. Repeat the procedure until all parameters have been considered.

In this way, I think we will not be repeating work because except for the initial step, we are only computing the p-values for a single tuning parameter. Do you think this is feasible? Thanks!

a-torgovitsky commented 4 years ago

Sounds feasible to me. But what if we have a procedure with multiple tuning parameters? Should we decide that all columns of pval besides the last correspond to tuning parameters?


These types of issues interact with the long-term goal of the module a bit. It would be nice in the future to be able to add procedures without having to think about how they are going to interact with the rest of the structure (lpmodel, invertci, etc.). So we should have some clear input/output rules that any new procedure has to follow. And we need to make these sufficiently general so that we can handle unknown procedures we come across in the future.

conroylau commented 4 years ago

Sure, I think it is a good idea to have a rule where all columns of pval besides the last column are tuning parameters. The column names except for the last one should be the name of the tuning parameters. I will update this in the part that computes the p-value in invertci.


I will also check the other parts of invertci to see if they are general enough for us to add new procedures in the future.

Thanks!

a-torgovitsky commented 4 years ago

Ok sounds like a plan. Thanks!

conroylau commented 4 years ago

I have just updated the invertci procedure where it can now construct confidence intervals when there are multiple parameters.

To illustrate that it works for procedures with multiple tuning parameters, I have created the toy procedure below. In this procedure, there are two tuning parameters beta and gamma. They can both accept multiple values from the user. For simplicity, I am skipping the data and lpmodel part of the procedure.

eg_procedure <- function(beta.tgt, beta, gamma) {
  # Data frame that contains the p-values
  df <- data.frame(matrix(vector(),
                          nrow = length(beta) * length(gamma),
                          ncol = 3))
  colnames(df) <- c("beta", "gamma", "p-value")

  # Assign the p-values
  k <- 1
  for (i in seq_along(beta)) {
    for (j in seq_along(gamma)) {
      df[k, 1] <- beta[i]
      df[k, 2] <- gamma[j]
      df[k, 3] <- max(abs(beta.tgt - beta * 5 + gamma * 2), 1)
      k <- k + 1
    }
  }
  return(list(pval = df,
              logical.lb = 0,
              logical.ub = 1))
}

As discussed above, it is returning a data frame pval that has three columns:

Then, I can use the above procedure with invertci to construct confidence intervals. As with the original design of invertci, users can pass multiple alpha to the procedure. The following code

library(lpinfer)
set.seed(1)
r <- invertci(f = eg_procedure,
              farg = list(beta = c(.1, .3),
                           gamma = c(.1, .5)),
              lb0 = 0,
              lb1 = .5,
              ub1 = .6,
              ub0 = 1,
              max.iter = 5,
              alpha = c(.9, .95),
              progress = FALSE)
print(r)

gives:

Confidence intervals:
 Significance level beta gamma  Confidence interval
               0.90  0.1   0.1 [0.4921875, 0.60625]
               0.90  0.1   0.5 [0.0078125, 0.99375]
               0.90  0.3   0.1 [0.4921875, 0.60625]
               0.90  0.3   0.5 [0.0078125, 0.99375]
               0.95  0.1   0.1 [0.4921875, 0.60625]
               0.95  0.1   0.5 [0.0078125, 0.99375]
               0.95  0.3   0.1 [0.4921875, 0.60625]
               0.95  0.3   0.5 [0.0078125, 0.99375]

During the iterations, the code will also show the parameters that are being used, such as the following excerpt from the output:

…
********** Constructing confidence interval for significance level = 0.95 **********                         
Parameters:    beta gamma
                0.3   0.5

 === Computing upper bound of confidence interval ===
 Iteration   Lower bound     Upper bound     Test point      p-value     Reject?
 Left end pt.    0.60000     NA      0.60000     0.50000     FALSE 
 Right end pt.   NA      1.00000     1.00000     0.50000     FALSE 
 1       0.60000     1.00000     0.80000     0.50000     FALSE 
…
a-torgovitsky commented 4 years ago

This is very good, nice job! Can you record your example in the examples folder? This may be useful in the future as an example of how to code in additional procedures.

conroylau commented 4 years ago

Thank you!

Sure, I have included the extra example under the example folder.

conroylau commented 4 years ago

I am updating an optional input from the user df_ci that is related to this issue. This table contains the p-values that correspond to different beta.tgt that were pre-evaluated by the user. This object is also returned from the invertci procedure, where it includes the p-values computed inside the procedure.

I have added a requirement that it needs to contain the value of the tuning parameters which correspond to each p-values and beta.tgt. This is because invertci can now take care of the procedures with multiple tuning parameters. Now the first n columns of the table corresponds to the n tuning parameters of the procedure. The last two rows are the beta.tgt and p-value.

Do you think I should rename the df_ci object as pvals? I am thinking if it will give a more informative name because it actually contains the p-values at different beta.tgt Since we have added the table ci that stores the confidence intervals in #96, I think using the name df_ci might be misleading too.

Thanks!

a-torgovitsky commented 4 years ago

Good point. I think pvals is a good name. I also think we want to make the name of the returned object the same as the name of the passed object to avoid confusion.

conroylau commented 4 years ago

Sure, no problem. Thanks!

a-torgovitsky commented 4 years ago

I'm not sure if this is working. For example, I would expect this would work:

library("lpinfer")
source("~/R/x86_64-pc-linux-gnu-library/3.6/lpinfer/example/dgp_mixedlogit.R")

set.seed(5)
dgp <- mixedlogit_dgp()
df <- mixedlogit_draw(dgp, n = 4000)

lpm <- lpmodel(A.obs = mixedlogit_Aobs(dgp),
               beta.obs = function(d) mixedlogit_betaobs(d, dgp),
               A.shp = rep(1, nrow(dgp$vdist)),
               beta.shp = 1,
               A.tgt = mixedlogit_Atgt_dfelast(dgp, w2eval = 1, eeval = -1))

r <- invertci(f = fsst, farg = list(lpmodel = lpm, data = df))
r <- invertci(f = fsst, farg = list(lpmodel = lpm, data = df), alpha = .10,
              pvals = r$pvals)

but I get

Error: The column names of the data frame 'pvals' need to contain 'point'                      
             and 'p-value'.
conroylau commented 4 years ago

Done! I have fixed a bug when the code is checking the input of pvals. The confidence interval for the first r is [0.2139, 0.77786]. The confidence interval for the second r is [0.21866, 0.76779]. (If I put set.seed(5) before the second r, it gives [0.22623, 0.77493]). Thanks!

a-torgovitsky commented 3 years ago

Ok, now it certainly works.

Can you explain what is going on inside? The display looks the same but I'm not sure what's happening computationally. Are points being skipped if the pval is already known?

conroylau commented 3 years ago

Yes, the computation will be skipped if pval is already known for the given parameters. For instance, the bisection in the second r will be reusing the p-values from pvals. But we still need to set the seed here to make sure the same lambda is used so that the p-values will be recycled.

set.seed(12345)
r <- invertci(f = fsst, farg = list(lpmodel = lpm, data = df))
set.seed(12345)
r <- invertci(f = fsst, farg = list(lpmodel = lpm, data = df), pvals = r$pvals)

Thanks!

a-torgovitsky commented 3 years ago

I see, my mistake forgetting to set the seed. Thanks for the clarification!

conroylau commented 3 years ago

No problem!