More open-ended issue; final approach not nailed down yet.
Overall:
Common prefix for package functions? birdie_ too long, but maybe brd_ or bd_ or something? Leaning against given the small number of functions exported
BISG side:
one unified predict_race() function, with flag for meas error model? Or separate _me variant? If latter, try to consolidate checking / table-building code
function name? Currently predict_race_sgz() which is somewhat verbose (also now Z -> X in the paper). wru uses predict_race(). Could call it bisg(), bisg_race(), etc.
function interface:
Currently have arguments S=, G=, etc. These could be more verbose (+/-s to that). Problem now is Z= is specified as tidy-select; not the most intuitive? (But also Z unlikely to be used much). Alternatively could do formula interface like ~ name(last_name) + zip + age, where surname is especially marked. Has advantage of flexibility but does require more typing for name() case.
Tables p_rs and p_rgz now provided as data frames to be linked, with column names matching the formula. This seems like a good idea, and the short names here argue for the S=G= etc. naming. Probably best to expose the census-table-making to the user (if possible), and use those as defaults here rather than hiding in the guts of the function. Would also let users see what the table needs to look like. Problem is these need stuff computed from the model formula etc.
Allow for no geographies to be used. Functionally this means creating a dummy 1-level geography
Goal is not to recreate all the functionality of wru::predict_race(), which has first + middle names, other decades, auto age+sex integration, etc. Rather to make most common use case—BISG with just surnames or zip codes—nice and easy, with a tidy & user-friendly interface. Anything else is also possible with custom p_* tables.
Census data functions: these basically take the S or G vector and make the table. They right now need a *_name param to name the column correctly. Could make the func signature like census_surname_table(..., p_r, counts=FALSE, flip=FALSE), with ... taking a single vector. Name of the vector for the output column name is used unless passed as a named parameter. So census_surname_table(last_name, p_r) or census_surname_table(surname=last_name, p_r).
Model side
Name: model_race() currently; about as vague as it gets. Probably should call this birdie()
Dream: function like birdie(Y ~ (1 | zip) + (1 | state) + zip_popdens, data=d, ..., method="mle", ...). I.e. custom model formula with random effects, estimated with EM. Other args to control computation, rather than e.g. _hmc variant like we have now. Still returns custom birdie S3 class with generics
Generic support (overall goal: make analogous to std model functions so people separate BISG from modeling step better, & appreciate role of modeling as part of this process. not purely plug-and-play magic box):
print()
summary() with more model info
predict() or fitted() to generate updated BISG probabilities pr(R | Y, X, G, S), which could in theory (TRIPLE CHECK THIS) be used within a weighting estimator later, without any bias
simulate() analogously to draw predictive R | Y, X, G, S for multiple imputation
as_draws_*() & other posterior:: / rvar generics for the Bayesians
plot() for some kind of model diagnostics
ranef() perhaps for table of random effects for Y|R. depends on implementation of birdie()
resid() and other generics---maybe pass through to the underlying GLMM, & use final-round EM estimates or something
is there a natural generic to pull out joint distribution? or just provide separate function for this.... calc_joint_model() kind of annoying to use
Think about how an average user would use the fitted conditional probabilities. Make it easy to generate a nice figure (autoplot() or something? with ggplot2 in Suggests:?) Make nice helper function to compute all pairwise disparities in a nice long (or also wide) format, along with uncertainty quantification? So like birdie(Y ~ (1 | zip), data=d) |> disparities(format="long") |> knitr::kable() for instant R-markdown output. Look into packages that do nice output & if there's a way to store metadata / implement generics to make this really nice
Consider storing some tiny metadata in the BISG output about what columns were used. Then can throw warnings if the columns used in birdie() suggest a violation of model assumptions (e.g. reusing surname, omitting some covariate, different geo levels, etc).
Make weighted estimator as easy to use as birdie(). Then in the documentation stress that the choice of method depends on causal structure. Maybe create an assumptions() function that lists the conditional independencies assumed by the fitted model, in terms of the relevant variable names.
Hide threshold and p-OLS estimator somewhere. Maybe put them in replication-specific code only. We don't want people using these.
Need to make decisions and then turn these into TODO issues.
More open-ended issue; final approach not nailed down yet.
Overall: Common prefix for package functions?
birdie_
too long, but maybebrd_
orbd_
or something? Leaning against given the small number of functions exportedBISG side:
predict_race()
function, with flag for meas error model? Or separate_me
variant? If latter, try to consolidate checking / table-building codepredict_race_sgz()
which is somewhat verbose (also now Z -> X in the paper).wru
usespredict_race()
. Could call itbisg()
,bisg_race()
, etc.S=
,G=
, etc. These could be more verbose (+/-s to that). Problem now isZ=
is specified as tidy-select; not the most intuitive? (But also Z unlikely to be used much). Alternatively could do formula interface like~ name(last_name) + zip + age
, where surname is especially marked. Has advantage of flexibility but does require more typing forname()
case.p_rs
andp_rgz
now provided as data frames to be linked, with column names matching the formula. This seems like a good idea, and the short names here argue for theS=
G=
etc. naming. Probably best to expose the census-table-making to the user (if possible), and use those as defaults here rather than hiding in the guts of the function. Would also let users see what the table needs to look like. Problem is these need stuff computed from the model formula etc.wru::predict_race()
, which has first + middle names, other decades, auto age+sex integration, etc. Rather to make most common use case—BISG with just surnames or zip codes—nice and easy, with a tidy & user-friendly interface. Anything else is also possible with customp_*
tables.S
orG
vector and make the table. They right now need a*_name
param to name the column correctly. Could make the func signature likecensus_surname_table(..., p_r, counts=FALSE, flip=FALSE)
, with...
taking a single vector. Name of the vector for the output column name is used unless passed as a named parameter. Socensus_surname_table(last_name, p_r)
orcensus_surname_table(surname=last_name, p_r)
.Model side
model_race()
currently; about as vague as it gets. Probably should call thisbirdie()
birdie(Y ~ (1 | zip) + (1 | state) + zip_popdens, data=d, ..., method="mle", ...)
. I.e. custom model formula with random effects, estimated with EM. Other args to control computation, rather than e.g._hmc
variant like we have now. Still returns custombirdie
S3 class with genericsprint()
summary()
with more model infopredict()
orfitted()
to generate updated BISG probabilities pr(R | Y, X, G, S), which could in theory (TRIPLE CHECK THIS) be used within a weighting estimator later, without any biassimulate()
analogously to draw predictive R | Y, X, G, S for multiple imputationas_draws_*()
& otherposterior::
/rvar
generics for the Bayesiansplot()
for some kind of model diagnosticsranef()
perhaps for table of random effects for Y|R. depends on implementation ofbirdie()
resid()
and other generics---maybe pass through to the underlying GLMM, & use final-round EM estimates or somethingcalc_joint_model()
kind of annoying to useautoplot()
or something? withggplot2
inSuggests:
?) Make nice helper function to compute all pairwise disparities in a nice long (or also wide) format, along with uncertainty quantification? So likebirdie(Y ~ (1 | zip), data=d) |> disparities(format="long") |> knitr::kable()
for instant R-markdown output. Look into packages that do nice output & if there's a way to store metadata / implement generics to make this really nicebirdie()
suggest a violation of model assumptions (e.g. reusing surname, omitting some covariate, different geo levels, etc).birdie()
. Then in the documentation stress that the choice of method depends on causal structure. Maybe create anassumptions()
function that lists the conditional independencies assumed by the fitted model, in terms of the relevant variable names.Need to make decisions and then turn these into TODO issues.