Organize everything around an `lpmodel` object

a-torgovitsky commented 4 years ago

Probably we want to use an S3 object for this, but I am not that familiar with the pro's and con's of R's object system. I think most things are S3 objects, so that is why I suspect this is the way to go.

An lpmodel is an object that can have the following properties:

An A.obs matrix
A A.shp matrix
An A.tgt matrix
A beta.shp vector
A beta.tgt vector
A beta.obs vector
A function func.obs that takes can take a dataframe as an input
- This function always returns a beta.obs vector
- It optionally also returns an asymptotic variance-covariance matrix associated with beta.obs
- Give the user flexibility in how they specify the function --- just make sure it returns one vector and one matrix, but don't force them to use specific naming schemes

--> Note that we are getting rid of inequalities as per issue #20 .

Depending on the routine being run, it might be ok for some or many of these fields to be empty, which is fine.

For example, estbounds can take either beta.obs or func.obs. If func.obs, then the user needs to pass data too. Also, estbounds does not need beta.tgt.

As another example, dkqs doesn't need A.shp or beta.shp, as the only deterministic constraint that the DKQS procedure allows for (other than the bound constraint) is the simplex constraint. (However, if they aren't empty, then make sure to check that the user actually passed the simplex constraint --- otherwise it is a sign they are confused.)

The envisioned workflow will then be something like:

User instantiates an lpmodel
The user passes this object to some combination of estbounds and/or one of the testing or confidence region procedures.

a-torgovitsky commented 4 years ago

Also, I should be able to call print on an lpmodel and get back some nice display that shows me what the underlying LP looks like in some informative way. I'm not exactly sure what the most informative way of doing this is, but take a stab at it and we will iterate a few times.

a-torgovitsky commented 4 years ago

One other thing here:

A beta.obs vector A function func.obs that takes can take a dataframe as an input This function always returns a beta.obs vector It optionally also returns an asymptotic variance-covariance matrix associated with beta.obs Give the user flexibility in how they specify the function --- just make sure it returns one vector and one matrix, but don't force them to use specific naming schemes

Can we combine this all into beta.obs? So that beta.obs is either:

A vector
A function that takes a dataframe and returns a vector
A function that takes a dataframe and returns both a vector and a variance-covariance matrix.

And maybe also give the user the option to pass bootstrap draws directly, so that beta.obs could also be a list of vectors (or a matrix), with the first one interpreted as the sample estimate, and the following ones interpreted as bootstrap draws of that estimate.

a-torgovitsky commented 4 years ago

Also, I guess it doesn't make much sense to have beta.tgt part of the lpmodel object, since beta.tgt is a specific value we want to test, and not a part of the model per se.

conroylau commented 4 years ago

Sure, I will separate beta.tgt out. I agree that it is better since we are testing beta.tgt so separating it from lpmodel can prevent us from changing lpmodel every time in case we are testing several beta.tgt.

Regarding beta.obs, I am thinking whether it is ideal to make it either a vector (that comes from passing data frame to the function) or a function. For instance, func.obs is needed inside the bootstrap procedure of the dkqs function. Thus, I am thinking if we should keep func.obs or beta.obs separate, or we can require beta.obs to be a function inside procedures like dkqs.

To pass the bootstrap draws directly, may I know does it mean that beta.obs be a list, where the first element is the point estimate, and the second element is perhaps a data frame of the bootstrap betas?

Thank you!

a-torgovitsky commented 4 years ago

or we can require beta.obs to be a function inside procedures like dkqs.

That is preferable, since there's not really any conceptual distinction between beta.obs as a vector or as a function.

To pass the bootstrap draws directly, may I know does it mean that beta.obs be a list, where the first element is the point estimate, and the second element is perhaps a data frame of the bootstrap betas?

I could see many ways to do this:

A list of vectors
A matrix of vectors
A list with the first element a vector and the second element a data.frame
A data.frame with the first row the point estimate, and the remaining B rows the bootstraps.

You should decide which is the easiest. My guess is that the single data.frame is the easiest. Although a matrix is probably sufficient, not sure why we would need a data.frame in this case.

conroylau commented 4 years ago

I see, sure, I would make beta.obs into an argument that can be either a function and/or matrix, depending the procedure used.

I agree that a matrix is sufficient for beta.obs. I think the 4th method by writing the first row as a point estimate and the remaining B rows as the bootstraps will be the easiest to implement.

Am I correct that if the user pass a matrix to beta.obs (more than 1 row), then the bootstrap procedure is skipped automatically if the matrix provided has B + 1 rows? If it has fewer than B + 1 rows, then the procedure will return an error, indicating that the arguments are inconsistent.

Thanks!

a-torgovitsky commented 4 years ago

Actually, I think we should more generally allow for all of these to be functions of the data: A.obs, A.shp, A.tgt, beta.obs, beta.shp

In DKQS only beta.obs can be stochastic, but in general we might consider procedures where other components can also be stochastic. (Actually, we could do this in subsampling --- I will make an issue about that next.)

Given that, I think we want three options for each component:

A single matrix/vector
A function that maps data to a matrix/vector
A list of matrix/vectors (so, using the list function).

Does that make sense?

conroylau commented 4 years ago

Yes, this and issue #23 makes sense.

For the vectors beta.obs and beta.shp, should I still allow them to be a matrix of B + 1 rows if the user would like to pass the bootstrap or re-sample versions of the betas? Or would it be better if I make them to be consistent with A.obs, A.shp and A.tgt that they would need to pass a list of vector if they want to pass the bootstrap or re-sample versions of the betas?

Thanks!

a-torgovitsky commented 4 years ago

I think it's better if you make them consistent and just have lists.

However, one thing to check before we make a decision is what the memory usage is of lists vs. matrices. Could you check that? That is, if I have 50 vectors of doubles arranged in a list, is that more memory than having those 50 vectors arranged into a single matrix?

conroylau commented 4 years ago

Yes, for the same amount of information, list uses more memory than matrices. For instance, in your example, a list has 10,848 byes and a matrix has 8,216 bytes if there are 50 vectors with 20 elements each, as generated by the code below.

set.seed(1)
N = 50
mat <- matrix(runif(N*20), nrow = N, byrow = TRUE)

lis <- NULL
set.seed(1)
for (i in 1:N){
  lis[[i]] = runif(20)
}

object.size(mat)
object.size(lis)

a-torgovitsky commented 4 years ago

That doesn't seem like too much of a difference. I think it's worth it to use the list since it will be more clear how the data should be input and is consistent for both vectors and matrices. Probably the memory usage on this dimension will not be important either.

conroylau commented 4 years ago

Done! I have updated the module by organizing the information around the lpmodel object. I have also updated the structure of the code and update some notations so that they are consistent across different files. I might further revise them when I start to write the code for the fsst procedure tomorrow.

Currently, the files in the R folder are organized as follows:

Tests/estimate bounds/construct confidence intervals: dkqs.R, subsample.R, invertci.R, estbounds.R, fsst.R
Functions to check the inputs from user: checks.R
LP/QP solvers: optim.R
lpmodel-related functions: lpmodel.R

I am also thinking how I should present the objects inside the lpmodel function.

Also, I should be able to call print on an lpmodel and get back some nice display that shows me what the underlying LP looks like in some informative way. I'm not exactly sure what the most informative way of doing this is, but take a stab at it and we will iterate a few times.

May I know would it be a good idea to print the following information of lpmodel for each of the objects A.obs, A.shp, A.tgt, beta.obs and beta.shp?

If it is a matrix or a data.frame, print its class and its dimension.
If it is a function, print its class.
If it is a list, print its class, the number of objects in the list and the dimension of the first element (print an error message if not all objects in the list have the same dimension).
If the class of more than one objects in lpmodel is list, print an error message if the number of objects in each list are different.

Thank you!

a-torgovitsky commented 4 years ago

That seems like good output for print.

However, the last two bullet points seem like they are also doing error-checking, right? So the error-checking component of those probably belongs somewhere else.

If it is a list, print its class, the number of objects in the list and the dimension of the first element (print an error message if not all objects in the list have the same dimension). If the class of more than one objects in lpmodel is list, print an error message if the number of objects in each list are different.

conroylau commented 4 years ago

I see, yes they are doing error checking. I agree they should belong to the error checking part, not in print. I am doing these checking in the beginning of the testing procedure, but I thought it might be useful to tell the user if they have specified lpmodel in the wrong way. I think I should modify the current check.lpmodel function (that is currently used inside the functions), so that user can call the function to do error checking if they want.

Regarding the part on list, would it be a good idea to revise the output as printing the dimension of the objects in the list and the number of objects for each dimension (if the user specify a list of functions, then the dimension is not printed)? If the user specifies all the objects in the list to have the same dimension, then this is the same as before. But if the user specifies a list with objects of different dimensions, I am thinking if this would be more informative than just printing the first element of the list.

Thanks!

a-torgovitsky commented 4 years ago

I think this is related to the idea of a constructor for a class. You should take a look at how others do object-oriented programming in R and see if there are any good ideas. Ideally, one would not be able to even specify an lpmodel object that is incoherent.

If we can assume that all lpmodel objects are coherent, then the second issue won't arise as we will know that any lpmodel passed to print (or any other function) will have objects of the same size in its list.

conroylau commented 4 years ago

I have just updated the print on lpmodel. A sample output looks like this:

Object     Class    Dimension   Length 
A.obs      matrix   11x22       1
A.shp      -empty-  -empty-     -empty-
A.tgt      matrix   1x22        1
beta.obs   function N/A     N/A
beta.shp   -empty-  -empty-     -empty-

May I know would the above output be okay? In addition, I am thinking would it be a good idea to give an optional data argument for print so that users can also check what would be the dimension for the output of the functions (in the example above, it would be the beta.obs function)?

Thanks!

a-torgovitsky commented 4 years ago

That output looks ok and what you propose sounds like a good idea.

conroylau commented 4 years ago

Updated!

With the following code:

library(lpinfer)
J = 11
yp <- seq(0, 1, 1/10)
A.obs <- cbind(matrix(rep(0,J*J), nrow = J), diag(1, J))
A.tgt <- matrix(c(yp, yp), nrow = 1)
func.full.info <- function(data){
  beta <- NULL
  y_list <- sort(unique(data[,"Y"]))
  n <- dim(data)[1]
  yn <- length(y_list)
  for (i in 1:yn){
    beta_i <- sum((data[,"Y"] == y_list[i]) * (data[,"D"] == 1))/n
    beta <- c(beta,c(beta_i))
  }
  beta <- as.matrix(beta)
  return(beta)
}

lpmodel <- list(A.obs = A.obs,
                A.tgt = A.tgt,
                beta.obs = func.full.info)
class(lpmodel) <- "lpmodel"

The command print(lpmodel) is giving

Object     Class    Dimension   Length 
A.obs      matrix   11x22       1
A.shp      -empty-  -empty-     -empty-
A.tgt      matrix   1x22        1
beta.obs   function N/A     N/A
beta.shp   -empty-  -empty-     -empty-

and the command print(lpmodel, data) is giving

Object     Class    Dimension   Length 
A.obs      matrix   11x22       1
A.shp      -empty-  -empty-     -empty-
A.tgt      matrix   1x22        1
beta.obs   function 11x1        1
beta.shp   -empty-  -empty-     -empty-

Thank you!

a-torgovitsky commented 4 years ago

Looks good!

Is there a way to generate an lpmodel object in one step? So instead of creating a list, then assigning it a class (two steps)?

In other languages I would think something like this would work:

lpm <- lpmodel(A.obs = A.obs,
                        A.tgt = A.tgt,
                        beta.obs = func.full.info)

conroylau commented 4 years ago

Done! Sure, I added a function called lpmodel that takes five arguments: A.obs, A.shp, A.tgt, beta.obs and beta.tgt (their default are NULL) and set lpmodel as a list with these five objects with class lpmodel. Thanks!

a-torgovitsky commented 4 years ago

Ok great. Is that the ways these things are usually done? Can you check? For example, what about data.frame? If I run data.frame(a = 1:5) does R construct a list in the background and set the class type?

conroylau commented 4 years ago

Yes, I think this is what they do for data.frame. From my understanding, data.frame takes the input (... in the function argument) and then assign it as a list via the command x <- list(...). Then, it assigns the class as "data.frame" via the command class(value) <- "data.frame", where value is the object returned in the data.frame function.

I also looked at a few other functions like matrix, date to see if they are doing something similar but it seems like these functions are calling C codes, so they did the things differently.

a-torgovitsky commented 4 years ago

Ok very good, let's close this for now at least

conroylau / lpinfer

Organize everything around an `lpmodel` object #21