Rename and reorganize functions

CoryMcCartan commented 1 year ago

More open-ended issue; final approach not nailed down yet.

Overall: Common prefix for package functions? birdie_ too long, but maybe brd_ or bd_ or something? Leaning against given the small number of functions exported

BISG side:

one unified predict_race() function, with flag for meas error model? Or separate _me variant? If latter, try to consolidate checking / table-building code
function name? Currently predict_race_sgz() which is somewhat verbose (also now Z -> X in the paper). wru uses predict_race(). Could call it bisg(), bisg_race(), etc.
function interface:
- Currently have arguments S=, G=, etc. These could be more verbose (+/-s to that). Problem now is Z= is specified as tidy-select; not the most intuitive? (But also Z unlikely to be used much). Alternatively could do formula interface like ~ name(last_name) + zip + age, where surname is especially marked. Has advantage of flexibility but does require more typing for name() case.
- Tables p_rs and p_rgz now provided as data frames to be linked, with column names matching the formula. This seems like a good idea, and the short names here argue for the S= G= etc. naming. Probably best to expose the census-table-making to the user (if possible), and use those as defaults here rather than hiding in the guts of the function. Would also let users see what the table needs to look like. Problem is these need stuff computed from the model formula etc.
Allow for no geographies to be used. Functionally this means creating a dummy 1-level geography
Goal is not to recreate all the functionality of wru::predict_race(), which has first + middle names, other decades, auto age+sex integration, etc. Rather to make most common use case—BISG with just surnames or zip codes—nice and easy, with a tidy & user-friendly interface. Anything else is also possible with custom p_* tables.
Census data functions: these basically take the S or G vector and make the table. They right now need a *_name param to name the column correctly. Could make the func signature like census_surname_table(..., p_r, counts=FALSE, flip=FALSE), with ... taking a single vector. Name of the vector for the output column name is used unless passed as a named parameter. So census_surname_table(last_name, p_r) or census_surname_table(surname=last_name, p_r).

Model side

Name: model_race() currently; about as vague as it gets. Probably should call this birdie()
Dream: function like birdie(Y ~ (1 | zip) + (1 | state) + zip_popdens, data=d, ..., method="mle", ...). I.e. custom model formula with random effects, estimated with EM. Other args to control computation, rather than e.g. _hmc variant like we have now. Still returns custom birdie S3 class with generics
Generic support (overall goal: make analogous to std model functions so people separate BISG from modeling step better, & appreciate role of modeling as part of this process. not purely plug-and-play magic box):
- print()
- summary() with more model info
- predict() or fitted() to generate updated BISG probabilities pr(R | Y, X, G, S), which could in theory (TRIPLE CHECK THIS) be used within a weighting estimator later, without any bias
- simulate() analogously to draw predictive R | Y, X, G, S for multiple imputation
- as_draws_*() & other posterior:: / rvar generics for the Bayesians
- plot() for some kind of model diagnostics
- ranef() perhaps for table of random effects for Y|R. depends on implementation of birdie()
- resid() and other generics---maybe pass through to the underlying GLMM, & use final-round EM estimates or something
- is there a natural generic to pull out joint distribution? or just provide separate function for this.... calc_joint_model() kind of annoying to use
Think about how an average user would use the fitted conditional probabilities. Make it easy to generate a nice figure (autoplot() or something? with ggplot2 in Suggests:?) Make nice helper function to compute all pairwise disparities in a nice long (or also wide) format, along with uncertainty quantification? So like birdie(Y ~ (1 | zip), data=d) |> disparities(format="long") |> knitr::kable() for instant R-markdown output. Look into packages that do nice output & if there's a way to store metadata / implement generics to make this really nice
Consider storing some tiny metadata in the BISG output about what columns were used. Then can throw warnings if the columns used in birdie() suggest a violation of model assumptions (e.g. reusing surname, omitting some covariate, different geo levels, etc).
Make weighted estimator as easy to use as birdie(). Then in the documentation stress that the choice of method depends on causal structure. Maybe create an assumptions() function that lists the conditional independencies assumed by the fitted model, in terms of the relevant variable names.
Hide threshold and p-OLS estimator somewhere. Maybe put them in replication-specific code only. We don't want people using these.

Need to make decisions and then turn these into TODO issues.

CoryMcCartan commented 1 year ago

Census data function implemented in 57b60bf

CoryMcCartan commented 1 year ago

Basic generics implemented in 6cafb88

CoryMcCartan commented 1 year ago

Closing in favor of #9, #10, and #11, now that 95% of this list is done

CoryMcCartan / birdie

Rename and reorganize functions #4