OpenSourceEcon / OGstoch-inc

Deterministic overlapping generations model with idiosyncratic stochastic income
2 stars 7 forks source link

Variable and function naming conventions and modularity #3

Open rickecon opened 7 years ago

rickecon commented 7 years ago

We should discuss naming convention of variables and how we want to structure the modularity of our functions as well as modules. @jorgebarro uses a convention of long descriptive names, whereas @jdebacker and @rickecon often use a shorter naming convention that is tied to the theoretical representation of the model.

Example: beta vs. discountFactor.

Also, we should discuss how to optimally divide up functions into Python modules as well as how to divide up functionality by functions.

jorgebarro commented 7 years ago

Regarding variable-naming, I propose that we choose longer, more descriptive names for at least two reasons. First, since we are open-sourcing the code, we want it to be as legible as possible to reduce the costs of learning the code. Second, with more descriptive names, we can speed up the collaborative development process by avoiding variable description scavenger hunts.

jdebacker commented 7 years ago

The approach I like best is to match the variable names in the code to the names used in the write up of the model.

So if we have our model documentation with the FOC: u'(c{s}) = \beta (1+r) u'(c{s+1}) (GitHub's markdown needs to handle Tex!), I like the code to be mu_c = beta * (1 + r) * mu_cp1 (or similar). In that way, the model equation directly maps to the code.

Of course there are many variables used in the computational solution that are not described in the characterizing equations and so there is more leeway with those under this approach.

But I've certainly favored shorter names to make the code concise. We are writing detailed docstrings for each function that describe each object used in that function, so I don't think the meaning is lost (or at least discovering it is not far away). In fact, I've liked names even shorted than @rickecon usually does. E.g., he often uses names like bvec to denote a vector of the variable b and bmat in other places when working with a matrix of b. I prefer to stick more closely to the theoretical description of the equations and just use b everywhere, letting the reader of the code learn the size of the b object through the docstring for that function.

jorgebarro commented 7 years ago

Agree that we should strive to develop code that is concise as possible. That said, we can write concise code without cutting corners on variable names. After reading several articles on best practices in variable-naming (and reflecting on past experiences), I'm still convinced that we should choose longer, more descriptive variable names with some limitations. From what I read, we should choose variable names that describe exactly what the variable does, while choosing words that describe it as concisely as possible. But we can choose a shorter naming convention for object or tuple names since they tend to be a part of a longer sequence of characters. For example, I would favor params over parameters to name the container.

In economics, we tend to choose variable names like they appear in the paper (and I am a recovering offender). At least two problems with trying to map variables to the code as they appear on paper: First, like @jdebacker mentioned, the paper representation of the model contains only a subset of the variables in the computer code. This means that we'd have some combination of paper model language and computer code language, leading to an inconsistency. Second, there is generally not a clear mapping between variables on paper and variables in code. For example "u'(c)" on paper was "mu_c" in code. This is a simple example that becomes more confusing with more complex calculations, like second derivatives.

I know it's an adjustment to write long, descriptive variable names, but in a collaborative environment, I think it's critically important. It can speed up the development process and improve maintainability. Instead of linking articles, I suggest researching best practices in variable-naming and giving your thoughts on it.

jdebacker commented 7 years ago

@jorgebarro, these are all fair points. Do you think this advice might differ for Python, where there are no end of line characters and where PEP8 formatting advises 72 (or 80) character lines? This might make longer, more descriptive variable names more difficult than in some other languages.

As an example of econ work in Python, consider the Quant Econ code which follows a variable naming convention along the lines of what I am arguing in favor of. For example, see this code on the endogenous grid method.

jorgebarro commented 7 years ago

@jdebacker, I agree with limiting the length of a line, but I also have no problem with splitting equations into multiple lines of code. Without end-of-line characters, we can still provide visual cues, like aligning all right-hand-sides of the equation, or aligning column elements of a matrix. See Indentation in PEP8 guide. By the way, I think the PEP8 guide is an excellent reference for generic naming conventions (i.e., class and function names). We can develop an econ-specific naming convention, like lower-case for individual values, and upper-case for aggregate values.

Regarding the Quant Econ code, I think they missed an opportunity to write more legible code. But even in the case of the endogenous grid code, they present it in the context of a very simple Neo-Classical Growth Model. If we want to develop a large-scale model with several novel contributions, the conventional notation of the standard model will denote a smaller set of variables with each extension of the model. For example, using "f" for production function is not a problem in the basic model. However, in a large-scale model, "f" may become unclear in the presence of "g" and "h" functions.

If our goal is outreach, then the code should be optimized for legibility. I was convinced to code in Python for maximum outreach, so if legibility is a constraint on the programming language, we should take a step back reconsider which language to use. Even if we decide to stick with Python for the development of a large-scale model, I still believe we should reduce the burden onto new users and improve the collaborative process by choosing more descriptive names.

rickecon commented 7 years ago

@jorgebarro and @jdebacker . I propose that we move forward with a hybrid of these two approaches. Here are some guiding principles:

  1. Code should reflect the variable names used in the official documentation. Putting this principle first implies that documentation exists and should exist. An example of documentation for a deterministic OG model is here. This could be interpreted as using beta for the concept of a discount factor because the Greek letter "beta" is used in the documentation. This also suggests that the documentation (not the code) is the primary place to learn about what the code does. The code is merely an execution of the model in the documentation. The documentation is a necessary condition for code that is fully ready for public consumption.
  2. Variable names that are not obviously tied to items in the data and for which ambiguous assignment may be an issue are recommended to have long-format camel case names.
  3. Regardless of how descriptive an object's name is in the code, each function should have a docstring that describes the inputs, function dependencies, new objects, new files created, and return objects of the function. An example of a function with accompanying docstring is the following. Note that there are only 15 lines of code, but many more lines of docstring.

    def get_n_errors(nvec, *args):
    '''
    --------------------------------------------------------------------
    Generates vector of static Euler errors that characterize the
    optimal lifetime labor supply decision. Because this function is
    used for solving for lifetime decisions in both the steady-state and
    in the transition path, lifetimes will be of varying length.
    Lifetimes in the steady-state will be S periods. Lifetimes in the
    transition path will be p in [2, S] periods
    --------------------------------------------------------------------
    INPUTS:
    nvec = (p,) vector, distribution of labor supply by age n_p
    args = either length 8 tuple or length 10 tuple, is (w, sigma,
           l_tilde, chi_n_vec, b_ellip, upsilon, diff, cvec) in most
           cases. In the time path for the age-S individuals in period
           1, the tuple is (wpath, sigma, l_tilde, chi_n_vec, b_ellip,
           upsilon, diff, rpath, b_s_vec, b_sp1_vec)
    
    OTHER FUNCTIONS AND FILES CALLED BY THIS FUNCTION:
        MU_c_stitch()
        MDU_n_stitch()
    
    OBJECTS CREATED WITHIN FUNCTION:
    w           = scalar > 0 or (p,) vector, steady-state wage or time
                  path of wage
    sigma       = scalar > 0, coefficient of relative risk aversion
    l_tilde     = scalar > 0, time endowment of each agent in each per
    chi_n_vec   = (p,) vector, values for chi^n_p
    b_ellip     = scalar > 0, fitted value of b for elliptical
                  disutility of labor
    upsilon     = scalar > 1, fitted value of upsilon for elliptical
                  disutility of labo
    diff        = Boolean, =True if use simple difference Euler errors.
                  Use percent difference errors otherwise
    cvec        = (p,) vector, distribution of consumption by age c_p
    mu_c        = (p,) vector, marginal utility of consumption
    mdun_params = length 3 tuple, (l_tilde, b_ellip, upsilon)
    mdu_n       = (p,) vector, marginal disutility of labor supply
    n_errors    = (p,) vector, Euler errors characterizing optimal labor
                  supply nvec
    
    FILES CREATED BY THIS FUNCTION: None
    
    RETURNS: n_errors
    --------------------------------------------------------------------
    '''
    if len(args) == 8:
        # Steady-state, and almost all of TPI solutions pass in cvec.
        # args is length 8
        (wpath, sigma, l_tilde, chi_n_vec, b_ellip, upsilon, diff,
            cvec) = args
    
    else:
        # solving for n_{S,1} in TPI does not pass in cvec, but passes
        # in rpath, b_s_vec, and b_sp1_vec instead. args is length 10
        (wpath, sigma, l_tilde, chi_n_vec, b_ellip, upsilon, diff,
            rpath, b_s_vec, b_sp1_vec) = args
        cvec = get_cons(rpath, wpath, b_s_vec, b_sp1_vec, nvec)
    
    mu_c = MU_c_stitch(cvec, sigma)
    mdun_params = (l_tilde, b_ellip, upsilon)
    mdu_n = MDU_n_stitch(nvec, mdun_params)
    if diff:
        n_errors = (wpath * mu_c) - chi_n_vec * mdu_n
    else:
        n_errors = ((wpath * mu_c) / (chi_n_vec * mdu_n)) - 1
    
    return n_errors
jorgebarro commented 7 years ago

@rickecon , first of all, I totally agree with leading the development of the computational model with the written model. That's a great point, and I think we should always prioritize the written document. That said, I still think that avoiding Greek letters and accents (like "tilde" and "hat") altogether would keep us from constantly having to trace the origin. For example, in the code above, I would choose time_endowment instead of l_tilde, crra instead of sigma, and consumptions instead of cvec. It's only a little bit longer but extremely descriptive.

The idea is that we are anticipating the development of a large-scale model. That means that maintaining the legibility of the code will become increasingly difficult. In the code above, naming convention is not a big deal because the code is brief. But imagine trying to make sense of variable when we have thousands of lines of code. Times in my recent past when we tried the Greek naming convention in a large-scale collaborative model, the code quickly became illegible, and it slowed down our progress.

To be clear, I believe that we should strive for short names, as long as they are descriptive. For example, consider transition probabilities, which are often represented by the Greek letter pi. transition_probabilities is very long, so we could use trans_probs instead. However, pi is not descriptive at all, so I would avoid it.

We need to move forward, so if anyone still has concerns, I think we should have a conference call and close this issue with one last comment summarizing the call.

rickecon commented 7 years ago

@jorgebarro, @jdebacker. Here is a first shot at a modules/function diagram. I think these are the functions that we want. I think these functions follow the principle of DRY (don't repeat yourself), with the exception of the get_SS_dist() and get_TP_dist() functions in aggregates.py. I don't know right now how to write the steady-state distribution solving function in the same format as the time path distribution solving function. [This function is available on Google Drive and I can share it if you want it.] image