Edit build_netparams to work with multiple combined jurisdictions

sgoodreau commented 2 years ago

@smjenness @AdrienLeGuillou your guidance needed here.

In issue #37 and pull request #38, I went as far as making sure the new county FIPS code functionality, and related multiple-merged-jurisdiction functionality, worked in build_epistats. What I didn't appreciate is that it would break later functions in the standard EpiModelHIV call flow, specifically build_netparams and probably build_netstats (I can't quite tell on the latter yet because I don't have good output from the former). Hopefully the ramifications don't go beyond that.

I believe the fix in build_netparams is relatively straightforward, with a caveat. Right now, the regressions off of which all of the parameters are calculated use d$geog as their predictor, which puts in a parameter for every separate value present in d$geog. As long as there are only a small number of units (as there are for NHBS cities, states, and census divisions/regions) this is not a problem. But for counties, many of which only have 1 respondent, this blows up. These all then get turned into predicted values for the observed geog.cat; however, this used to always be a vector of length 1 and now is vector of arbitrary length, and the code here errors out.

The straightforward solution is to replace the use of d$geog as the predictor with d$geogYN. This thus adds a single binary predictor to the model, of whether the respodent is in the included geog set or not. Because the model was fully saturated before, and only allowed one geog.cat in the predictor, once you convert to predicted values the results should very similar for all previous models, while also allowing for the two forms of additional flexibility elegantly. (They won't be identical if there are correlations with other predictors). Indeed, when I dove into the code I was surprised to see that d$geogYN wasn't used, since it seems like that's exactly what it would have been constructed for. Here is an example, taken from running build_netparams in debug mode, with added test code:

myepistats_ATL <- build_epistats(
              geog.lvl = "city",
              geog.cat = "Atlanta",
              init.hiv.prev = c(0.33, 0.137, 0.084),
              race = FALSE,
              time.unit = 7
            )

debug(build_netparams)

mynetparams_ATL <- build_netparams(
                    epistats = myepistats_ATL, 
                    smooth.main.dur = TRUE)

### Lines 113-117 of the function
mod <- glm(same.age.grp ~ geog + index.age.grp, data = lmain, 
      family = binomial())
dat <- data.frame(geog = geog.cat, index.age.grp = 1:age.grps)
pred <- predict(mod, newdata = dat, type = "response")
out$main$nm.age.grp <- as.numeric(pred)

### Check results
out$main$nm.age.grp         # [1] 0.7815998 0.6963146 0.5949834 0.4848550 0.3761780

### Revised version:
mod2 <- glm(same.age.grp ~ geogYN + index.age.grp, data = lmain, 
      family = binomial())
dat2 <- data.frame(geogYN = 1, index.age.grp = 1:age.grps)
pred2 <- predict(mod2, newdata = dat2, type = "response")
out$main$nm.age.grp2 <- as.numeric(pred2)
out$main$nm.age.grp2         # [1] 0.7802957 0.6957081 0.5954386 0.4865149 0.3788573

As you can see - similar, but not identical.

The straightforward solution is to go through and replace all of the d$geog regressions with d$geogYN regressions, knowing that this will accommodate the new functionality, but that past results will remain similar but not identical.

The less straightforward way is to add a flag that says that if either geog.lvl=='county' or length(geog.cat) > 1 then use d$geogYN in the regression, otherwise use d$geog. This is less elegant but preserves backward-compatibility.

A third option is to allow the user to specify, with default using d$geog, but with the constraints that it has to use d$geogYN in the above two conditions.

Alas, these latter two options will add a lot of code, since there are many regressions conducted. It's easy to write (a lot of copy-and-paste) but inelegant.

I'm happy to implement whichever option you all think is best.

If only my project involved NYC metro, not NYC proper :-( Indeed I probably could have just used the NYC metro behavioral data and be done with it, and nobody would be the wiser. But here I am.

AdrienLeGuillou commented 2 years ago

I think the first option is the cleanest in terms of code (good for long term maintainability) and also in terms of model stability. Having a lot of modalities with 1-2 respondent is could (even if it seems marginal here) produce unstable coefficients. The binary approach seems the most fitted as we only care for geog.YN == 1 in the end.

sgoodreau commented 2 years ago

Thanks @AdrienLeGuillou. I agree, but needed to hear from you two that it's OK since it's your analyses for which perfect backwards compatibility will break. But the results are still so close, and old code doesn't produce an error, and this is what we have renv for. @smjenness are you OK with this too?

In terms of coding, the task could be made even simpler by assigning the values of d$geog <- d$geogYN (and one other similar change) early in the function; then all of the rest of the code would work and wouldn't need to be switched. I'm not a fan of that since then the variable names wouldn't match their content well, but it would reduce the room for coding errors. The full task is also pretty easy, though, so I'm just throwing this out there as a thought, but not a very serious one.

Last question for now: do you two know if d$geog and the like cease to be used after build_netparams and build_netstats, or will I keep bumping into new issues? I'll do a full code search on my own, but figure you may already have a sense. Hopefully this will be it.

AdrienLeGuillou commented 2 years ago

I think using geogYN is unfortunately safer than assigning it to geog

I honestly don't know for the other question, I think not but this part of the code is not the one I am the most familiar with

smjenness commented 2 years ago

Hi @sgoodreau : thanks for delving into this. My suggestion/request:

Solution 1: Use geogYN instead of geog for the regressions in build_netparams
Do a global find and replace for the variable name, rather than just assigning geog <- geogYN at the top
You will also need to update these lines data.frame(geog = geog.cat)
I don't believe there are any subsequent issues in the following functions in ARTNet or EpiModelHIV. build_netstats uses the geogYN variable when coming up with predicted probabilities of HIV infection for model initialization. EpiModelHIV itself uses two of the GLM model objects (acts Poisson model and condom logistic model), but both of these use geogYN as established in build_epistats. So it should be pretty safe to go.
Having said that, let me know when you get a chance to run through the full EpiModelHIV model and at that point we can merge in this PR.

Thank you thank you!

sgoodreau commented 2 years ago

@smjenness @AdrienLeGuillou

Circling back to this, I just tested and then committed the revisions to build_netparams. They work wonderfully. And...then I go to build_netstats, and discover I've hit a dead end, or at least a longer road than I was hoping to traverse. The only geogrqphy-related item here is to pull in proportions of the population belonging to each of EpiModel's three racial/ethnic groups from a data structure that includes these values for every possible geography that EpiModel allows. This is easy peasy for me to do for the one use case I actually care about (NYC). To do it for every county in the US isn't that hard, but probably time consuming. To do it in a way that allows for multiple jurisdictons at any level to be combined means going back and adding in the pop sizes for all of the NHBS cities, and all states then editing the code to take weighted averages when there is more than one jurisdiction. This is all feeling like too much work to generalize code that may or may not ever be used by anyone other than me, when my one use case would take me just a few more minutes. Although perhaps if you pointed me to the code/person that pulled and crunched the census numbers for the states, I'd see that it's not too difficult. Alas this little diversion has slowed me down more than I initially expected :-(

Thoughts?

AdrienLeGuillou commented 2 years ago

If I understand correctly, you have the proportion of racial/ethnic group for all geographic zones. But you lack the population size of each zone necessary to calculate such proportions when using an aggregate of multiple jurisdiction. Is that the issue?

sgoodreau commented 2 years ago

The lack of pop size is the smaller of the two issues; but yes, it is one of the two issues.

The bigger is that we have the racial/ethnic composition in there for US states, NHBS metro areas and census regions. But to follow the logic through from my earlier expansion in options, we would now need to add it for US counties as well. There are over 3000 of these, some very small. As long as the original numbers were proportions among all people in the jurisdiction, or all males, or all males within a certain age range, we should be able to obtain these for every county. Hoewver, if they were derived from estimates of MSM (e.g. from the work of Gray et al) then we might not. We would have them for the five very large counties I actually care about though (the boroughs of NYC). If the data are available for all counties, it would still be a bit of data scraping and crunching to get them.

If I had info on the year, exact metric, and data source used to get these (the Cnesus Bureau produces multiple types of values and estimates), and/or the code used to process those into the included numbers, that would likely bring the task down to just a few hours of work. So determining the state of that info is the key to deciding on the route forward.

sgoodreau commented 2 years ago

OK, I have traced the data back to the xslx format found here: https://github.com/EpiModel/ARTnetData/blob/master/inst/RaceDistribution.xlsx

And a description of the population here (https://github.com/EpiModel/ARTnetData/pull/3) which makes clear the proportions are among males aged 15-65. But not what year/source (American Community Survey 2019?)

Looks like @connor-vanmeter was responsible for the work. I know he's left the group so I'm not sure we'll be able to get more info. @smjenness do you know? The formatting of the final tables in xlsx reather than csv, and the internal appearance, makes me believe these were ultimately done with a fair amount of hand manipulation rather than pure reproducible code.

sgoodreau commented 2 years ago

Hi team - as time goes by it becomes clearer that I won't have the wherewithal to do the generalized version here.

In the meanwhile, I've gone and gotten the numbers for NYC, and also added an optional argument in build_netstats that allows one to pass the race distribution instead of taking it from ARTnetData. This gets me where I need to be for now and I can move on. I could clean this up and document it, such that the earlier added functionality could still be added to main, with folks aware that if they're going to use it they need to figure out ad supply the race/ethnicity distribution. Or we could just leave this branch unmerged forver. Happy to do either, or discuss here or in real time to decide.

smjenness commented 2 years ago

It would be nice if you could implement the version that allows us to pass in an arbitrarily defined race distribution, if I understand correctly and it's not too much work. Thanks!

sgoodreau commented 2 years ago

OK, can do. Note that it still helps to know the answers to my previous questions above. Obviously the More info you have the better, but it's also OK if the answers are all "No, I don't have any more info anywhere on what Conor did" - just knowing that will help me move forward.

smjenness commented 2 years ago

The answers are all: No, I don't have any more info anywhere on what Conor did. :)

EpiModel / ARTnet

Edit build_netparams to work with multiple combined jurisdictions #39