Gkreindler / GMMTools.jl

MIT License
9 stars 2 forks source link

Thoughts on your package #2

Open pdeffebach opened 1 year ago

pdeffebach commented 1 year ago

Gabriel,

It was great meeting you last week. As promised, here are my thoughts on your package. Overall I think it is a very useful addition to the ecosystem and I intend to emulate it for estimating my job search model until this package is ready for production, in which case I will use it outright.

GMMTools.jl and Dr. Watson.jl

As an overall thought on this package, I'm not entirely sure what it wants to be. Is it an GLM-like tool for performing optimization, or is it a Dr. Watson, like tool for caching intermediate results.

I am not super familiar with Dr. Watson, but reading the documentation, it's main functionality overlaps a lot with yours. Most importantly, it caches the inputs to the estimation function to avoid unnecessarily re-running optimization. It also allows for parallel processing of analyses.

The caching and file-path handling takes up a significant amount of code in your code-base. To what extent can you write a more minimized estimation package which you confirm interacts nicely with Dr. Watson?

My guess is that Dr. Watson is useful for storing a lot of "final" estimation results, but is maybe not as useful for storing intermediate results within a single estimation, such as manually inspecting the CSV file storing optimum parameters from initial conditions for odd estimates. Your package is also useful for wrapping individual moment-function calls inside try blocks to so you don't get a failure after a week of computation.

Either way, I wonder if you can delineate more where your package ends and where Dr. Watson begins, to try and avoid doing too much.

GMMTools.jl and GLM.jl

You write in your readme states

The ultimate aim is to offer a similar set of features that a typical OLS package adds above and beyond coding directly `(X'X)-1X'Y`.

Currently, your API is not set up to do that in the sense that in res = run_estimation(...), res is not an object defined by you and does not offer a convenient set of methods for extracting useful information. There is no minimzer(res) (From Optim.jl) or any stderr(res) etc.

I understand that you need to get the functionality down first, and that coding up all the options for gmm_options is a lot of work, but I actually think creating an Input struct and an Output struct would make the development process a lot easier, not just improve the user experience.

Starting with output, it might make sense to have separate output objects for each estimation method

abstract type GMMOutput end
struct CMDOutput
    # Store the history of estimation
    # Store the histrory of bootstrapping
    # Store the best parameters
    # Store the weighting matrix
    # Store the jacobian ssssssssssssssssssssssssssssss
end

struct OneStepOutput
    ...
end

when developing, now you can focus on "what output do users want". If they want the minimzer, write a function for minimizer(res::CMDOutput) and iterate until all functionality is complete (obviously this is easier said than done. But it's helped me a lot to start with the API and work backwards from that)

I think the same can be said for the inputs. Currently you have a Dict to store everything, but with a Dict you lose powerful multiple dispatch. A lot of the logic in your code currently is spent checking if certain conditions are met and then calculating accordingly. Multiple dispatch makes this a lot easier. Your outer function can be short and all the inner-logic can be split into smaller functions and separate files. Something like

abstract type GMMInput end

struct CMDInput
    # Store the input data
    # Store the moment function
    # Store the omega matrix
    # Store the printing parameters
    # Store bootstrapping options
end

run_estimation(::CMDInput) = ...
run_estimation(::OneStepInput = ...

This strategy is broadly what GLM.jl, FixedEffectModels.jl, and Econometrics.jl do.

Multiple dispatch would also make it easier to add new backends, like Optim.jl and GalacticOptim.jl, since you can store the optimizer in a singleton struct like OptimizerOptim() or OptimizerGalacticOptim() in a struct and dispatch the backend optimizer based on these values.

Custom structs might also make saving intermediate inputs easier. You can define "serialization" and "deserialization" functions for each custom struct. This can punt out complicated "does this file exist" logic to a separate function that creates directories from the outset.

Mathematical appendix

I took both the basic grad sequence as well as the "econometrics concentration" and macro sequence at BU, and even then they didn't cover what CMD means vs. GMM. I think it would be worth it to include some documentation which re-hashes Hansen in your preferred notation. It's something I wouldn't mind contributing to, since it would probably help me learn more about GMM.

I've included a pull request for documentation. So theoretically in the future there might be a place to put this stuff.

Miscellaneous

julia> ~5
-6

Anyways, those are some thoughts on your package.

I'm sorry they aren't more mathematical. But I did take a look at the implementation details and they look good to me . I could check it more if there were a mathematical appendix.

Gkreindler commented 8 months ago

@pdeffebach, this was extremely useful! Thank you again for the careful look and all the feedback! I reply here for recordkeeping. Your further feedback is much welcome but not expected.

I did a complete re-write of the package over the past months (still in progress) along the lines you recommended.

Note, the package now runs "plain" GMM (moment function returns an N x M matrix for N observations and M moments). I will add Classical Minimum Distance (CMD) in the next months.

A few more responses:

res is not an object defined by you and does not offer a convenient set of methods for extracting useful information. There is no minimzer(res) (From Optim.jl) or any stderr(res) etc.

The main fit() function now returns an object type GMMFit that has attributes like myfit.theta_hat (estimated parameters). There are separate functions vcov_simple(...) and vcov_bboot(...). These create a GMMVcov object and store it in myfit.vcov. We now have functions (like read_fit()) to read these types of objects from file(s).

Mathematical appendix I took both the basic grad sequence as well as the "econometrics concentration" and macro sequence at BU, and even then they didn't cover what CMD means vs. GMM. I think it would be worth it to include some documentation which re-hashes Hansen in your preferred notation. It's something I wouldn't mind contributing to, since it would probably help me learn more about GMM.

I've included a https://github.com/Gkreindler/GMMTools.jl/pull/1 for documentation. So theoretically in the future there might be a place to put this stuff.

This is a wonderful idea, and your contribution would be much appreciated and I would be happy to discuss and sketch. I do not know how to maintain the documentation pages so I've put it on hold for now. But I've included bits and pieces in a few .md files that act as a tutorial (they are linked to from the main readme file.

I would use an OrderedDict rather than a DataFrame to store results. You should omit the DataFrames.jl dependency entirely and use PrettyTables.jl for printing.

I stuck with DataFrames because at some point it's necessary to sort a table by objective value and I didn't know how to do it with OrderedDict. I also wanted to easily read/write to csv. Are these possible with OrderedDict? In general, I am sure there may be dependencies that are not strictly necessary.

Why does the moment function need to be a 1 x K matrix? I wouldn't naturally write m to be a matrix like that Similarly, why a matrix for various initial theta guesses?

In GMM it's natural to have the moment function return an N x M matrix (N = observations, M = moments). We can think of CMD as N=1 and we handle the variance-covariance matrix separately. The initial theta guesses can now be a vector (length P, the number of parameters) or a K x P matrix, where K is the number of initial conditions to use.