Bootstrap procedure? - Githubissues

elbersb / segregation

R package to calculate entropy-based segregation indices, with a focus on the Mutual Information Index (M) and Theil’s Information Index (H)

https://elbersb.com/segregation

Other

35 stars 3 forks source link

Bootstrap procedure? #2

Closed bfjarvis closed 5 years ago

bfjarvis commented 5 years ago

Hi,

Thanks for the package! You've put a lot of work into it!

I'm wondering about the bootstrapping procedure, though. It looks like you are bootstrapping based on samples (with replacement) of group-by-unit observations, but is that the right way to go? Wouldn't it make more sense to take bootstrap samples of individuals within units? Much more computationally intensive, but it seems more theoretically justifiable.

elbersb commented 5 years ago

Glad that you're finding the package useful!

I assume you're referring to these lines? The code takes N samples from the unit-group combinations, using the frequencies as weights, where N is the individual sample size. I could've expanded the data frame to individual cases, but that's less efficient computationally. So basically it's a bootstrap based on the individual observations. Not sure if that answers your question, but I think that should be right.

bfjarvis commented 5 years ago

I see. I may have just misunderstood the underlying code. I guess when you resample, you return what are essentially individual observations and this bit:

[list(freq = .N), by = vars]

collapses those back down to counts at the group and unit level.

For my part, I've been trying to wrap my head around whether this process makes sense when a particular unit has a count of 0 for a particular group. Bootstrapping (or, equivalently, sampling from a multinomial distribution with group probabilities given by the group composition of the unit) guarantees that the count will be zero again, but that doesn't seem quite right.

bfjarvis commented 5 years ago

Just a thought, but it might run faster if you use rmultinom for the bootstrapping, if I'm right that bootstrapping is equivalent to drawing from a multinomial distribution.

elbersb commented 5 years ago

I see. I may have just misunderstood the underlying code. I guess when you resample, you return what are essentially individual observations and this [bit] collapses those back down to counts at the group and unit level.

Exactly.

For my part, I've been trying to wrap my head around whether this process makes sense when a particular unit has a count of 0 for a particular group. Bootstrapping (or, equivalently, sampling from a multinomial distribution with group probabilities given by the group composition of the unit) guarantees that the count will be zero again, but that doesn't seem quite right.

Yeah, I see what you mean. I'll leave this issue open to think more about that. It seems though, that what you want would only be possible once you impose some model.

And yes, I'll see whether I can make the bootstrapping go a bit faster.

elbersb commented 5 years ago

Sampling from a multinomial was a great idea. I've implemented that (e60fb706d8d90d31) and see a four-fold speed increase.