alashworth / test-issue-import

0 stars 0 forks source link

Distinguishing between data and transformed data #119

Open alashworth opened 5 years ago

alashworth commented 5 years ago

Issue by bgoodri Sunday Dec 11, 2016 at 07:47 GMT Originally opened as https://github.com/stan-dev/stan/issues/2165


Summary:

The generated C++ code should have a way to distinguish between private members that were declared in the data block and private members that were declared in the transformed data block.

Description:

From just looking at (or re-parsing) the C++ code, it is not possible to tell what is data and what is transformed data. This is a limitation because interfaces cannot easily infer what Stan is expecting to be passed and, e.g., error out before compiling if something from data is not supplied. In the near future, I would like to write a StanProgramWithData reference class that has fields for the symbols in the data block of the Stan program.

We could fix this ambiguity because according to http://www.cplusplus.com/doc/tutorial/classes/

By default, all members of a class declared with the class keyword have private access for all its members. Therefore, any member that is declared before any other access specifier has private access automatically.

So, if things in data were declared before the private: access specifier and things in transformed data were declared after, I could scan the C++ file up to private:.

Reproducible Steps:

Parse

data {
  int<lower=0> J; // number of schools 
  real y[J]; // estimated treatment effects
  real<lower=0> sigma[J]; // s.e. of effect estimates 
}
transformed data {
  real<lower=0> sqrt_J;
  sqrt_J = sqrt(J);
}
parameters {
  real mu; 
  real<lower=0> tau;
  real eta[J];
}
transformed parameters {
  real theta[J];
  for (j in 1:J)
    theta[j] <- mu + tau * eta[j];
}
model {
  eta ~ normal(0, 1);
  y ~ normal(theta, sigma);
}

Current Output:

The C++ code looks like

class model4c5b418d36dd_8schools : public prob_grad {
private:
    int J;
    vector<double> y;
    vector<double> sigma;
    double sqrt_J;
public:
...

Expected Output:

class model4c5b418d36dd_8schools : public prob_grad {
    int J;
    vector<double> y;
    vector<double> sigma;
private:
    double sqrt_J;
public:
...

Additional Information:

Current Version:

v2.13.1

alashworth commented 5 years ago

Comment by bob-carpenter Monday Dec 12, 2016 at 17:00 GMT


I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

alashworth commented 5 years ago

Comment by bob-carpenter Monday Dec 12, 2016 at 17:01 GMT


I'm just about to commit a refactor that should make the AST code much easier to follow. It was one monolothic file with no organization and now it's broken down into directories of about 100 files with one class or function per file.

alashworth commented 5 years ago

Comment by bgoodri Monday Dec 12, 2016 at 18:45 GMT


Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter notifications@github.com wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

alashworth commented 5 years ago

Comment by bob-carpenter Monday Dec 12, 2016 at 21:32 GMT


I can add this pretty quickly. What exactly do you need?

Is it enough to add the following five static functions?

static std::vector data_variables(); ... transformed_data_variables(); ... parameters_variables(); ... transformed_parameters_variables(); ... generated_quantities_variables();

Or is there a way where it'd be easier to get. If you need sizes, it can't be static and in fact the sizes can depend on values in earlier variable blocks (e.g., you can define an integer in transformed data that determines the size of a parameter vector).

On Dec 12, 2016, at 1:45 PM, bgoodri notifications@github.com wrote:

Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter notifications@github.com wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by bgoodri Monday Dec 12, 2016 at 21:39 GMT


Those five functions would be a step in the right direction. Currently, there is a method called get_num_upars that returns the number of unconstrained parameters after the model has be instantiated. Can't we have a similar function like

std::vector get_dims(std::string name)

that returns the dimensions of the given variable?

On Mon, Dec 12, 2016 at 4:32 PM, Bob Carpenter notifications@github.com wrote:

I can add this pretty quickly. What exactly do you need?

Is it enough to add the following five static functions?

static std::vector data_variables(); ... transformed_data_variables(); ... parameters_variables(); ... transformed_parameters_variables(); ... generated_quantities_variables();

Or is there a way where it'd be easier to get. If you need sizes, it can't be static and in fact the sizes can depend on values in earlier variable blocks (e.g., you can define an integer in transformed data that determines the size of a parameter vector).

  • Bob

On Dec 12, 2016, at 1:45 PM, bgoodri notifications@github.com wrote:

Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter < notifications@github.com> wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266559225, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqtNMSSWuPYRFbFehilxBnL1dj2PMks5rHb1ygaJpZM4LJ3Vo .

alashworth commented 5 years ago

Comment by bob-carpenter Monday Dec 12, 2016 at 21:46 GMT


Yes, but it can't be static. We can return the number of dimensions or even something representing the shape (int | real | vector | row_vector | matrix) x array-dims. But the sizes aren't known until the data is read in.

On Dec 12, 2016, at 4:39 PM, bgoodri notifications@github.com wrote:

Those five functions would be a step in the right direction. Currently, there is a method called get_num_upars that returns the number of unconstrained parameters after the model has be instantiated. Can't we have a similar function like

std::vector get_dims(std::string name)

that returns the dimensions of the given variable?

On Mon, Dec 12, 2016 at 4:32 PM, Bob Carpenter notifications@github.com wrote:

I can add this pretty quickly. What exactly do you need?

Is it enough to add the following five static functions?

static std::vector data_variables(); ... transformed_data_variables(); ... parameters_variables(); ... transformed_parameters_variables(); ... generated_quantities_variables();

Or is there a way where it'd be easier to get. If you need sizes, it can't be static and in fact the sizes can depend on values in earlier variable blocks (e.g., you can define an integer in transformed data that determines the size of a parameter vector).

  • Bob

On Dec 12, 2016, at 1:45 PM, bgoodri notifications@github.com wrote:

Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter < notifications@github.com> wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266559225, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqtNMSSWuPYRFbFehilxBnL1dj2PMks5rHb1ygaJpZM4LJ3Vo .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by bgoodri Monday Dec 12, 2016 at 21:51 GMT


Right. So, maybe we need static methods for the types, static methods for what block they are from, and non-static methods for dimensions.

On Mon, Dec 12, 2016 at 4:46 PM, Bob Carpenter notifications@github.com wrote:

Yes, but it can't be static. We can return the number of dimensions or even something representing the shape (int | real | vector | row_vector | matrix) x array-dims. But the sizes aren't known until the data is read in.

  • Bob

On Dec 12, 2016, at 4:39 PM, bgoodri notifications@github.com wrote:

Those five functions would be a step in the right direction. Currently, there is a method called get_num_upars that returns the number of unconstrained parameters after the model has be instantiated. Can't we have a similar function like

std::vector get_dims(std::string name)

that returns the dimensions of the given variable?

On Mon, Dec 12, 2016 at 4:32 PM, Bob Carpenter <notifications@github.com

wrote:

I can add this pretty quickly. What exactly do you need?

Is it enough to add the following five static functions?

static std::vector data_variables(); ... transformed_data_variables(); ... parameters_variables(); ... transformed_parameters_variables(); ... generated_quantities_variables();

Or is there a way where it'd be easier to get. If you need sizes, it can't be static and in fact the sizes can depend on values in earlier variable blocks (e.g., you can define an integer in transformed data that determines the size of a parameter vector).

  • Bob

On Dec 12, 2016, at 1:45 PM, bgoodri notifications@github.com wrote:

Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter < notifications@github.com> wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment- 266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266559225, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqtNMSSWuPYRFbFehilxBnL1dj2PMks5rHb1ygaJpZM4LJ3Vo

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266562754, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqmbfe8LzG38OvBO0EKnMwmqqlTanks5rHcCegaJpZM4LJ3Vo .

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Dec 13, 2016 at 19:50 GMT


This is all relatively easy with the types we have now, but is going to become horrifically complicated when we add sparse types.

Right now, the run-time type of an object is just this:

base_type: (int | real | vector | row_vector | matrix) num_dims: integer

The basic shape is just integer vs. real plus number of total dimensions.

What exactly do you want out? The constraints come in the form of lower and upper bounds on basic types, plus more strucutred constraints like simplex or cov_matrix. Those are also available statically, so they could be a string indicating which type of constraint is involved:

(none, lb, ub, lub, simplex, ordered, positive_ordered, unit_vector, cov_matrix, corr_matrix, cholesky_corr, cholesky_cov)

The values of the constraint, like the actual lower and upper bounds, are not available until runtime.

Statically, I could write a function that returns a bunch of triples of (string, base_type, num_dims) for each block. Or I could just return (int | real) for base type and aggregate number of dims, which determines the shape (but not size) the inputs have to be.

It's much more complicated to figure out sizes at run time because the sizes can be functions. So you have to actually execute the expressions in the Stan program declaration blocks to get the sizes. But once you get the data, the sizes are fixed.

On Dec 12, 2016, at 4:51 PM, bgoodri notifications@github.com wrote:

Right. So, maybe we need static methods for the types, static methods for what block they are from, and non-static methods for dimensions.

On Mon, Dec 12, 2016 at 4:46 PM, Bob Carpenter notifications@github.com wrote:

Yes, but it can't be static. We can return the number of dimensions or even something representing the shape (int | real | vector | row_vector | matrix) x array-dims. But the sizes aren't known until the data is read in.

  • Bob

On Dec 12, 2016, at 4:39 PM, bgoodri notifications@github.com wrote:

Those five functions would be a step in the right direction. Currently, there is a method called get_num_upars that returns the number of unconstrained parameters after the model has be instantiated. Can't we have a similar function like

std::vector get_dims(std::string name)

that returns the dimensions of the given variable?

On Mon, Dec 12, 2016 at 4:32 PM, Bob Carpenter <notifications@github.com

wrote:

I can add this pretty quickly. What exactly do you need?

Is it enough to add the following five static functions?

static std::vector data_variables(); ... transformed_data_variables(); ... parameters_variables(); ... transformed_parameters_variables(); ... generated_quantities_variables();

Or is there a way where it'd be easier to get. If you need sizes, it can't be static and in fact the sizes can depend on values in earlier variable blocks (e.g., you can define an integer in transformed data that determines the size of a parameter vector).

  • Bob

On Dec 12, 2016, at 1:45 PM, bgoodri notifications@github.com wrote:

Static methods would be fine, and as long as we are adding them (soon) then there should be ones for parameters, transformed parameters, and generated quantities that also say what the dimensions are. I suppose that could be done from the AST but that currently is not accessible from the interfaces.

On Mon, Dec 12, 2016 at 12:00 PM, Bob Carpenter < notifications@github.com> wrote:

I'd much rather not do anything that involves scanning the C++ file, because it's a level of compatibility we don't want to support. What I think we should do for this is have static methods on the generated C++ that tell what the variables are. Or you should scan the program object generated in the AST from parsing the actual Stan program. The AST has two distinct structures from which you can recover all the types of all the variables in all the blocks.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment- 266486741, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqjR8skTnArkPPp09FlUvECHCCVw4ks5rHX24gaJpZM4LJ3Vo .

— You are receiving this because you were assigned.

Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266559225, or mute the thread https://github.com/notifications/unsubscribe-auth/ ADOrqtNMSSWuPYRFbFehilxBnL1dj2PMks5rHb1ygaJpZM4LJ3Vo

.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2165#issuecomment-266562754, or mute the thread https://github.com/notifications/unsubscribe-auth/ADOrqmbfe8LzG38OvBO0EKnMwmqqlTanks5rHcCegaJpZM4LJ3Vo .

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub, or mute the thread.

alashworth commented 5 years ago

Comment by bgoodri Wednesday Dec 14, 2016 at 05:07 GMT


I think we need to be able to query as much information as possible, particularly about the things in the data block.

So, the first thing would be a function to get the names of each thing declared in each block (excluding the model block and local blocks). Once we have those names, we ought to be be able to input a name and output what looks to the user of the Stan language as a type (real, int, vector, row_vector, simplex, unit_vector, ordered, positive_ordered, matrix, cov_matrix, corr_matrix, cholesky_factor_cov, cholesky_factor_corr) even though most of those are not really types.

I'm not too worried about lower bounds or upper bounds on things in the data block because in order to know what they are, we would have to instantiate it with data, and if we can instantiate it, then the data much have satisfied the lower and / or upper bounds. The lower bounds and upper bounds are more important for parameter, even though the bounds might not be constants. Can we just output the realizations of the bounds at each iteration?

We are definitely going to need a function that inputs a name and outputs its dimensions so that we know how big to make the arrays that hold the draws.

For R, it would be fine to return a tuple of heterogeneous things, but I'm not sure how well that would work for other interfaces. So, it may be better to make the functions return homogenous things in the cases where there is more than one output.

alashworth commented 5 years ago

Comment by bob-carpenter Tuesday Dec 20, 2016 at 07:00 GMT


We have the names and dimensions for parameters, transformed parameters, and generated quantities. I could add similar functions for data and transformed data if that'd help.

There are already names and dimensions for parameters and transformed parameters, but only the shape, not the array/vector/matrix type, much less constraints. We could store the Stan runtime type (int/real/vector/row_vector/matrix + dimensions) or the full constrained type.

The big design question will be whether to expose the AST or go with something different. The AST has proper variant types enforcing runtime type safety, but at the cost of making everything a complicated callback. We could define something more string based if that'd be enough.

alashworth commented 5 years ago

Comment by bob-carpenter Monday Jul 31, 2017 at 17:56 GMT


superseded by #2360