Open mxndrwgrdnr opened 5 years ago
I figured out how to perform stratified sampling while generating the entire universe of alternatives for all observations at once, and included the fix in the latest commit here. Just had to reorder the observation ids to repeat in sequence (e.g. [1,2,3,1,2,3,1,2,3]) instead of in order (e.g. [1,1,1,2,2,2,3,3,3]).
This looks great, thanks Max!
I feel like the most common use case for other folks is going to be using strata to overweight or underweight particular categories of alternatives. So we should make sure we can add support for that later without too much trouble (i'm mostly thinking about the API here, since that's harder to change than the implementation details).
Looks like a nice way to support that would be to add a parameter called strata_weights
that takes a dict or Series mapping each strata id to a proportional value. And the sampling_regime
and strata
parameters would be unchanged. Does this sound right? That seems completely compatible with what you've implemented here, which is perfect.
[ ] bump the version number -- looks like 0.2.2.dev1, and we can go ahead and release it on pypi/conda-forge soon
setup.py
choicemodels/__init__.py
docs/source/index.rst
[ ] add an entry to CHANGELOG.md
@smmaurer so that def prob is the most common use case for folks, however, I'm not sure we necessarily want to give them that ability until we also implement functionality to add a correction term to the probabilities, right? Unless we do that, we'd only be giving them the ability to generate bad estimates, no? What we can do is loosen the restriction on strata being evenly distributed by simply calculating the strata population proportions on-the-fly and then sample from the strata accordingly, which would obviate the need for an additional strata_weights
parameter.
Unless we do that, we'd only be giving them the ability to generate bad estimates, no?
AFAICT No, If the same bias was used in estimating the coefficients then it should work correctly.
I feel like it would be fine to allow stratified sampling of alternatives before we implement the correction for MNL estimation.
Honestly, I'm still not convinced that a correction will have any practical effect in most cases, or even be appropriate..
If you're using strata to generate more realistic choice sets, these should actually be the baseline. It's the random sample from the full universe of conceivable alternatives that would produce biased estimates. (I think this is related to Jacob's point -- we're generally sampling to assert something about the choice sets, not just for our own convenience)
Everything i've seen indicates that, in addition to making conceptual sense, over-sampling the higher utility alternatives also performs well statistically -- e.g. Lemp & Kockelman 2012
And even when a sampling correction is necessary, it generally doesn't matter once the number of alternatives is more than a few dozen -- e.g. Frejinger 2007, Jarvis 2018
Responding to your points, @maurer:
In the first iteration of this strategic process, SRS of alternatives is used. In each iteration thereafter, alternative inclusion probabilities are set equal to the MNL choice probabilities derived from the previous iteration’s parameter estimates. The likelihood function in the second and any later iterations is updated to include the probability of choice set formation (using weights on alternatives that are proportional to the prior iteration’s choice probability estimates).
The results show that models including a sampling correction are remarkably better than the ones that do not
The reason why I'm convinced we need a correction factor is that McFadden's derivation of the MNL basically says you need one UNLESS the positive conditioning property and uniform conditioning property are met, which really only holds true under simple random sampling of alts.
Max and i just chatted about this in person, and i agree that it's important to implement the sampling correction, but i also suspect the effect will often be negligible.
I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?
Ideally we want to know someone's true choice set, i think, which in housing/transport situations will have budget and accessibility constraints. What McFadden shows is that random sampling is unbiased compared to the full universe, which is kind of a separate issue. If we're using weights to indicate which alternatives are more or less likely to be in the choice set, that seems ok to me. We're still sampling randomly from our best guess of the full choice set.
Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section
Max is right! It's on page 4, and might help us implement the sampling correction.
The results show that models including a sampling correction are remarkably better than the ones that do not
My reading of the figures is that they show the sampling correction becoming much less important after there are at least a few dozen alternatives. But presumably this is pretty situation-dependent, so including the correction does seem safer.
So, as long as this ticket is still open, I can probably get the proportional strata implemented, to accommodate strata that are not evenly distributed but still sampled randomly and proportionally? Then we can open an issue to implement importance sampling?
Implemented a new feature of the
MergedChoiceTable
class for performing uniform stratified sampling of alternatives. For use cases that require heavy sampling of alternatives to make the problem tractable (e.g. location choice models), we want to make sure that individual choice sets are still representative. Two new arguments/attributes to theMergedChoiceTable
class,sampling_regime
andstrata
, allow the user to trigger stratified sampling and to specify the column from the table of alternatives defines the strata groupings. Because we want each observation to have a representative choice set of alternatives, we have to iterate over each observation and generate choice sets one at a time. As a result, this new sampling method is just as slow as the various sampling without replacement methods even though we are sampling with replacement at a macro level. I think there may be a better way of doing this, whereby the choice sets for all observations can be constructed in one go, without iterating over observations, but I'll leave that for the time being.