Edges-Only Troubleshooting

EmilyPo commented 4 years ago

html file attached with example.

fitting an edges-only model my nsfg egodata usually results in a DRAMATIC drop in the edge count while running netdx. the only model that doesn't do this is when I set the dissolution coefs exit rate to 0. In this scenario the edge count remains relatively steady (but the edge duration is about 10% off).

@martinamorris @sgoodreau

martinamorris commented 4 years ago

you have some births, right?

martinamorris commented 4 years ago

ppopsize:

This warning is useful advice:

Using a smaller pseudopopulation size than sample size usually does not make sense.

You've got ~44K obs, and you're scaling down to 5K. That could cause all sorts of problems with cell counts going to 0.

Your estimation should be reasonably quick even with 45K. Have you tried?

But the other problem I'm seeing is this:

version 1

Note: Constructed network has size 996, different from requested 5000. Estimation should not be meaningfully affected.

version 2

Note: Constructed network has size 996, different from requested 5000. Estimation should not be meaningfully affected.

other network, version 1

Note: Constructed network has size 5006, different from requested 10000. Estimation should not be meaningfully affected.

I've never seen discrepancies this large, so I think something's probably wrong. Not sure what, will need to ask Chad. You might want to give him access to this repo in case we want to tag him with a question. (Update: see https://github.com/EmilyPo/Diss-Duration/issues/1#issuecomment-578319982)

martinamorris commented 4 years ago

For every model, you should run a summary(nw ~ terms) first! ALWAYS.

That way you can see if there are target stats with 0 or very small values. For best diagnostics, you want to modify the levels command to levels=T, which will keep all levels, and let you see what your intended reference category count looks like.

sgoodreau commented 4 years ago

Yes there are births and deaths and aging out.

Which makes me suddenly realize that I don't remember how netdx(dynamic=TRUE) handles the births and deaths that occur in the full simulation model. The formation model has been fit conditional on the death rate that is passed in dissolution_coefs, meaning it should work correctly in a full dynamic simulation with that departure rate, but not otherwise. But then netdx(dynamic=FALSE) doesn't have access to those full vital dynamics. Does it revert to the crude dissolution rate that is still stored in the dissolution_coefs object? @martinamorris do you know off the top of your head? If not, we can look in the code or the NME materials or as Sam.

EmilyPo commented 4 years ago

Ok let me see if I can parse some of this / respond.

I have births in the simulation, but I don't know how to incorporate that into the netdx (this has actually always been a source of confusion for me). Perhaps I am misunderstanding the whole ergm.ego-netest-netdx workflow.
The first ergm.ego object estimated (eom) has a population size equal to the respondents in nsfg (~42k). The reason the other objects are smaller is because I used Pavel's method for determining appropriate network size based on survey weights in the data and determined that it needed to be around 5,000 nodes (because all the survey weights are huge). This felt weird to me, but the three of us discussed this when I was originally fitting models a few months ago and I was told this was ok. I also used population sizes of 5, 10, and, 20000 to compare fits and it didn't make a huge difference but we decided that 10,000 seemed like a reasonable size. Martina, you also told me that if I put in say, 10,000 as the ppop.size and the actual network ended up being lower than that, that it was nothing to worry about. Do I need to correct any of this?
the call to "eom" in my code was supposed to print out the target stats, not sure why rMarkdown chose not to print that out. Does summary(nw~terms) call work here? I tried on my ergm.ego object and the netest object I generated from JKB's function, but it's not an object with an edgelist so I'm getting an error. In this case though, for the 42k size model the target is ~9000 for edges so it's not an issue of small cell sizes.

martinamorris commented 4 years ago

1. I have births in the simulation, but I don't know how to incorporate that into the netdx (this has actually always been a source of confusion for me). Perhaps I am misunderstanding the whole ergm.ego-netest-netdx workflow.

ok, i'll see if i can pull some template scripts for you. (update: take a look at this script https://github.com/statnet/WHAMP/blob/master/adams_egodx_darc/eeDiag_darc.Rmd)

2. The first ergm.ego object estimated (eom) has a population size equal to the respondents in nsfg (~42k). The reason the other objects are smaller is because I used Pavel's method for determining appropriate network size based on survey weights in the data and determined that it needed to be around 5,000 nodes (because all the survey weights are huge). This felt weird to me, but the three of us discussed this when I was originally fitting models a few months ago and I was told this was ok. I also used population sizes of 5, 10, and, 20000 to compare fits and it didn't make a huge difference but we decided that 10,000 seemed like a reasonable size. Martina, you also told me that if I put in say, 10,000 as the ppop.size and the actual network ended up being lower than that, that it was nothing to worry about. Do I need to correct any of this?

aha, that's a misunderstanding i think. i'm sure there's standard nomenclature for the many different types of "weights" that can be used in surveys, but i'm not sure what that is (shameful, i know).

the weight that you're using is one that scales the sample up to the population (with appropriate strat and post-strat adjustments). let me call this the "pop weight".

another version of the weight rescales the observations to the same sample size, with the appropriate strat and post-strat adjustments. let me call this the "sample weight".

i always work with the sample weight (which is just the pop weight * samp size/pop size). that just makes more sense to me. and i'm pretty sure that's what ergm.ego will expect in the weight vector.
you never want to scale down your sample size (except in very unusual circumstances).
to get the recommended ppopsize, solve for: min samp wt*ppopsize = 3.

the call to "eom" in my code was supposed to print out the target stats, not sure why rMarkdown chose not to print that out. Does summary(nw~terms) call work here? I tried on my ergm.ego object and the netest object I generated from JKB's function, but it's not an object with an edgelist so I'm getting an error. In this case though, for the 42k size model the target is ~9000 for edges so it's not an issue of small cell sizes.

i always use the summary call. and yes, it does work in ergm.ego, so you may be doing something wrong. you don't use it on the netest object, you use it on the egodata object.

martinamorris commented 4 years ago

Also, it's possible the discrepancy between ppopsize desired and achieved is due to the weight treatment:

control.ergm.ego(ppopsize = c("auto", "samp", "pop"), ppopsize.mul = 1, ppop.wt = c("round", "sample"), stats.wt = c("data", "ppop"), stats.est = c("asymptotic", "bootstrap", "jackknife", "naive"), boot.R = 10000, ergm.control = control.ergm(), ...)

JKB uses ppop.wt = "sample" so you can try this and see if it works.

EmilyPo commented 4 years ago

Ok - trying to deal with the pop weight vs sampling weight at the moment. Your explanation makes sense, but I am a bit confused when I convert the weight and then use the new weights to determine and appropriate pseudo-population size.

(in the example below, "nsfg" is the raw data I import from SPSS, and "nsfg_complete" is after some variable re-naming and the weights adjustment.)

This would suggest my pseudopopulation size should either be roughly 28 million...or less than 1. This cannot be correct....

martinamorris commented 4 years ago

For my equation, you want 3/minwt, not 3*minwt. But mine deviates from the ergm.ego tutorial, and I've convinced myself with a small example that the ergm.ego tutorial is correct.
Even using the q.25 wt, and the 1 observation target, you'd still get 147K
So, I think we need @krivit to weigh in here (so you'll need to give him access).
If we are going to stray from the rule here, it'd be good to know who those low weights belong to, as we'll likely not have those folks in our network

EmilyPo commented 4 years ago

@krivit

To catch you up, I'm trying to use data from the National Survey of Family Growth (USA) with ergm.ego to estimate a network and eventually use it to simulate in EpiModel.

The dataset combines several waves of the survey into one file with roughly 43k respondents. The survey weights are complex and include multiple factors but ultimately weight each respondent up to the national population size by age, sex, and race. I adjusted the weights to that they reflect the within-sample weight (population weight * (sampleSize/sumOfPopulationWeights).

Going by the rule for determining appropriate pseudo-population size outlined in the ergm.ego tutorial (3(sampleSize/minimumWeight)), I determined that my network should be... size 28 million. As Martina mentions above, even using the q25 weight threshold and only 1 (sampleSize/minimumWeight) in this equation leads me to a network of size 147k.

We wanted to bring you in to discuss when it's reasonable to stray from the guidelines for pseudopopulation size, and maybe some additional guidelines for what we should do here? When others on our team use the target stats approach to develop their networks they often have more complex models (with many race categories, many sexual groups, etc), and use smaller networks so it seems like there should be some solution...I hope.

@martinamorris I'm working up a description of those low weight inds so we know who they are.

krivit commented 4 years ago

To clarify, are you using ergm.ego the package, or are you computing target statistics and such manually?

EmilyPo commented 4 years ago

I am using ergm.ego.

krivit commented 4 years ago

OK... One saving grace is that actors are interchangeable if they have the same attributes that matter to the model. So, what you can try is:

Figure out which nodal attributes matter to the model, and drop the rest. In particular, for quantitative attributes that you code as ordinal categories for ranges, convert them to categories and drop the quantitative variables.
For each distinct combination of relevant attributes, sum up the sampling weights of the data rows with that combination. Hopefully, the smallest weight won't be too small.
Construct a pseudopopulation in a form of a data frame where each row is repeated in proportion to its total sampling weight.
Pass this data frame to estimation code as control.ergm.ego(ppopsize=data). This will force ergm.ego() to use data to construct the pseudopopulation network.

martinamorris commented 4 years ago

@EmilyPo it would be good to implement @krivit's suggestion when you start getting serious about the results. For starting purposes, I think you can estimate using ppop=samplesize.

The key thing we want to diagnose first is whether you continue to get those huge discrepancies in the ppopsize requested and produced. @krivit -- i'm referring to https://github.com/EmilyPo/Diss-Duration/issues/1#issuecomment-578279128

EmilyPo commented 4 years ago

If I use the size of the sample as my pseudopop, it keeps that network size (although I do get a warning message that using the same pseudopopulation size as the sample size is not advisable under weighted sampling.)

Screen Shot 2020-01-29 at 4 30 36 PM

martinamorris commented 4 years ago

didn't you run several tests using different ppop sizes and find that the coefs were very similar?

EmilyPo commented 4 years ago

Summary below. At the time I think I was only paying attention to the age-related terms...I didn't notice that the edges term varies as much as it does.

Screen Shot 2020-01-29 at 5 10 24 PM

martinamorris commented 4 years ago

i'd be curious to see what the edges-only model would look like across those ppopsizes.

krivit commented 4 years ago

The key thing we want to diagnose first is whether you continue to get those huge discrepancies in the ppopsize requested and produced. @krivit -- i'm referring to #1 (comment)

Hard to say. I'd want code and data to reproduce this. I don't think I've ever tested ergm.ego with such a large sample, and certainly never with sample size larger than pseudopopulation size.

martinamorris commented 4 years ago

Hard to say. I'd want code and data to reproduce this. I don't think I've ever tested ergm.ego with such a large sample, and certainly never with sample size larger than pseudopopulation size.

we used exactly these data for the 2.5 yr SHAMP project. so sample size is not the issue. but we always used an estimation ppopsize of 50K, so i'm wondering if the downsizing is causing this.

code is easy to get you, data have some restrictions. but i think we need a reprex as this behavior is not what we'd want, even if we don't want to encourage downsampling.

EmilyPo commented 4 years ago

this is the edges-only marriage/cohab network estimated on various ppopsizes:

Screen Shot 2020-01-30 at 2 39 16 PM

martinamorris commented 4 years ago

Did you verify that the net size you asked for was the net size you got?

EmilyPo commented 4 years ago

Yes - when I switched to using “ppop.wt=sample” the network size matched the inputed ppopsize in all the networks.

EmilyPo commented 4 years ago

For example, when I estimate a network without "ppop.wt=sample" I get this warning:

Screen Shot 2020-01-30 at 2 55 33 PM

martinamorris commented 4 years ago

Sweet. Ok, then I'd say the variability is pretty large, and the value of the estimated adjusted coef appears to be correlated to size, but I'd like to see the s.e.'s before passing judgement.

Also, would like to see what these estimates look like with a weight vector = rep(1,N)

martinamorris commented 4 years ago

@EmilyPo @cklumb given the questions we had today about the impact of small weights on both the 'sample' and 'round' options for egodata construction, I'm wondering if it is worth implementing @krivit's suggestion above https://github.com/EmilyPo/Diss-Duration/issues/1#issuecomment-579638711. My interpretation of his comment is:

For the ergm, it's not the individual weights that matter, because those weights include the effects of many factors that are not included in the ergm. As far as the model is concerned, all cases with the same values for the predictors included in the model are "exchangeable". In this "cell", defined by the crosstab of all the predictors, all cases have the same impact on the model, regardless of their individual weight.
So the real weight of interest for the ergm is the sum(within cell weights). If that is very small, then you probably won't get any observations in that cell of the predictor space, and that could raise problems for estimation. This is much more of a problem for factor levels (if there are no observations in that level) than for continuous predictors, as the latter leverage a parametric linear assumption about the relationship between the logit(p(tie)) and the value(predictor). So if you're missing a couple of ages, you still have the rest of the age range to anchor the estimate.

Bottom line: it would be good to take a look at these sum(within cell weights), hopefully getting the magic value of 3/cell.

One remaining concern would arise if, in the simulation, you used other variables as predictors for epi dynamics (e.g., testing, or treatment). Then you'd want decent distribution of these variables also.

EmilyPo / Diss-Duration