uniform stratified sampling of alts

mxndrwgrdnr commented 5 years ago

Implemented a new feature of the MergedChoiceTable class for performing uniform stratified sampling of alternatives. For use cases that require heavy sampling of alternatives to make the problem tractable (e.g. location choice models), we want to make sure that individual choice sets are still representative. Two new arguments/attributes to the MergedChoiceTable class, sampling_regime and strata, allow the user to trigger stratified sampling and to specify the column from the table of alternatives defines the strata groupings. Because we want each observation to have a representative choice set of alternatives, we have to iterate over each observation and generate choice sets one at a time. As a result, this new sampling method is just as slow as the various sampling without replacement methods even though we are sampling with replacement at a macro level. I think there may be a better way of doing this, whereby the choice sets for all observations can be constructed in one go, without iterating over observations, but I'll leave that for the time being.

coveralls commented 5 years ago

Coverage increased (+0.04%) to 76.111% when pulling 035d9a48094a04028e52d29ad805874a1b9bc4c4 on stratified_sampling into 54c936d15457923c8ebb5a29b129ce4da8e7e0c7 on master.

mxndrwgrdnr commented 5 years ago

I figured out how to perform stratified sampling while generating the entire universe of alternatives for all observations at once, and included the fix in the latest commit here. Just had to reorder the observation ids to repeat in sequence (e.g. [1,2,3,1,2,3,1,2,3]) instead of in order (e.g. [1,1,1,2,2,2,3,3,3]).

smmaurer commented 5 years ago

This looks great, thanks Max!

I feel like the most common use case for other folks is going to be using strata to overweight or underweight particular categories of alternatives. So we should make sure we can add support for that later without too much trouble (i'm mostly thinking about the API here, since that's harder to change than the implementation details).

Looks like a nice way to support that would be to add a parameter called strata_weights that takes a dict or Series mapping each strata id to a proportional value. And the sampling_regime and strata parameters would be unchanged. Does this sound right? That seems completely compatible with what you've implemented here, which is perfect.

To do before merging

[ ] bump the version number -- looks like 0.2.2.dev1, and we can go ahead and release it on pypi/conda-forge soon
- setup.py
- choicemodels/__init__.py
- docs/source/index.rst
[ ] add an entry to CHANGELOG.md

mxndrwgrdnr commented 5 years ago

@smmaurer so that def prob is the most common use case for folks, however, I'm not sure we necessarily want to give them that ability until we also implement functionality to add a correction term to the probabilities, right? Unless we do that, we'd only be giving them the ability to generate bad estimates, no? What we can do is loosen the restriction on strata being evenly distributed by simply calculating the strata population proportions on-the-fly and then sample from the strata accordingly, which would obviate the need for an additional strata_weights parameter.

Eh2406 commented 5 years ago

Unless we do that, we'd only be giving them the ability to generate bad estimates, no?

AFAICT No, If the same bias was used in estimating the coefficients then it should work correctly.

smmaurer commented 5 years ago

I feel like it would be fine to allow stratified sampling of alternatives before we implement the correction for MNL estimation.

Honestly, I'm still not convinced that a correction will have any practical effect in most cases, or even be appropriate..

If you're using strata to generate more realistic choice sets, these should actually be the baseline. It's the random sample from the full universe of conceivable alternatives that would produce biased estimates. (I think this is related to Jacob's point -- we're generally sampling to assert something about the choice sets, not just for our own convenience)
Everything i've seen indicates that, in addition to making conceptual sense, over-sampling the higher utility alternatives also performs well statistically -- e.g. Lemp & Kockelman 2012
And even when a sampling correction is necessary, it generally doesn't matter once the number of alternatives is more than a few dozen -- e.g. Frejinger 2007, Jarvis 2018

mxndrwgrdnr commented 5 years ago

Responding to your points, @maurer:

I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?
Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section:

In the first iteration of this strategic process, SRS of alternatives is used. In each iteration thereafter, alternative inclusion probabilities are set equal to the MNL choice probabilities derived from the previous iteration’s parameter estimates. The likelihood function in the second and any later iterations is updated to include the probability of choice set formation (using weights on alternatives that are proportional to the prior iteration’s choice probability estimates).
Again, unless I'm mistaken, I believe the Jarvis paper is specifically addressing the use of a correction factor in the case of "simple random choice set sampling", which is not what we're talking about with regards to strategic oversampling/undersampling. I can't say I read the whole Frejinger paper either but this sentence from the abstract stood out to me:

The results show that models including a sampling correction are remarkably better than the ones that do not

The reason why I'm convinced we need a correction factor is that McFadden's derivation of the MNL basically says you need one UNLESS the positive conditioning property and uniform conditioning property are met, which really only holds true under simple random sampling of alts.

smmaurer commented 5 years ago

Max and i just chatted about this in person, and i agree that it's important to implement the sampling correction, but i also suspect the effect will often be negligible.

I thought that McFadden's positive conditioning theory showed explicitly that random sampling of alts from the full universe shows produces unbiased estimates, no?

Ideally we want to know someone's true choice set, i think, which in housing/transport situations will have budget and accessibility constraints. What McFadden shows is that random sampling is unbiased compared to the full universe, which is kind of a separate issue. If we're using weights to indicate which alternatives are more or less likely to be in the choice set, that seems ok to me. We're still sampling randomly from our best guess of the full choice set.

Unless I'm mistaken, I believe the authors of this paper do apply a correction/modification to their likelihood function as described in their methodology section

Max is right! It's on page 4, and might help us implement the sampling correction.

The results show that models including a sampling correction are remarkably better than the ones that do not

My reading of the figures is that they show the sampling correction becoming much less important after there are at least a few dozen alternatives. But presumably this is pretty situation-dependent, so including the correction does seem safer.

mxndrwgrdnr commented 5 years ago

So, as long as this ticket is still open, I can probably get the proportional strata implemented, to accommodate strata that are not evenly distributed but still sampled randomly and proportionally? Then we can open an issue to implement importance sampling?

UDST / choicemodels

uniform stratified sampling of alts #66

To do before merging