Sampling weights for MergedChoiceTable

smmaurer commented 7 years ago

1) Feature overview

When sampling from a large set of conceivable alternatives (home location choices, travel destination choices), users may want to assign weights to different types of alternatives rather than using pure random sampling. This can be a way of generating more realistic choice sets, or of improving computational efficiency by ensuring an appropriate degree of variation among sampled alternatives.

MergedChoiceTable() should accept a parameter for sampling weights, use them in generating the merged table, and store the weights somehow in the returned object.

2) Format for the weights

I think there will be three different use conditions to handle.

N(weights) = N(alternatives), with the weights consistent across choosers. For example: people are choosing from a library of books where there's a long tail of unpopular titles.
N(weights) = N(alternatives) x N(choosers). For example: people are choosing restaurants, and are more likely to consider the ones closer to their own home.
Same as prior, but with a large enough number of choosers + alternatives that it's more efficient to provide an InteractionGenerator() than a lookup table. (See Issue #4.)

3) Common use cases

A common use for this will be weights that are inversely related to the distance between chooser and alternative.

Another case will be alternatives that are divided into importance strata conditioned on types of choosers. (An InteractionGenerator() could be a nice way to specify this as well.)

4) Sampling implementation

To implement the sampling, we need to refactor the underlying mnl_interaction_dataset() code. Instead of drawing a single large sample of alternatives to distribute among choosers, we'll iteratively draw a new sample for each chooser, passing appropriate weights to np.random.choice().

Refactoring the code this way will also let us try sampling without replacement, in other words ensuring that each alternative is sampled no more than once for a given chooser. This is more expensive computationally, and probably not very important for estimation or prediction -- but I think it's the nicest behavior for a procedure like this, especially for small sample sizes.

There's an example of iterative sampling (without weights) in section 3a of the Sampling-correction-02 notebook.

5) Estimation corrections

Will we need to implement sampling corrections in the model estimation? Probably not, especially if the sampling is used to generate more realistic choice sets.

But this is something to look into, along with methods for out-of-sample validation. I have some references that I'll add to this thread.

smmaurer commented 6 years ago

Updating this issue from a year ago. We've had a number of discussions about this, but haven't implemented the full feature yet.

APPLYING sampling weights

One task is writing code to carry out the sampling when weights are passed to the MergedChoiceTable() constructor (interaction.py#70). This is in my court, I think.

GENERATING sampling weights

In a case where we have NxM distinct weights, they might sometimes be unwieldy to calculate (network distance) and/or to store (off the cuff, 100k choosers x 100k alternatives = 10 trillion weights = 10 GB if each weight is 8 bits). Here are some approaches to try:

a) Precomputing

Precompute weights, store them on disk (or in memory), pass them to MergedChoiceTable(). This could be a good approach if computation time is slow but we are likely to need all the weights and to use them multiple times.

b) Calculating on the fly

Pass a generator function to MergedChoiceTable(), and calculate weights on the fly for particular combinations of choosers and alternatives. This could be a good approach if memory is the bottleneck, or if we will only need weights for a small number of combinations (e.g. model estimation using a small sample of choosers).

c) Stratified weights

In some cases it might be more efficient to use a heuristic to calculate approximate weights, or to store group-based weights rather than individual ones.

This is what @gboeing is working on in PR #32 (distance bands). Geoff, have you done any evaluation of the speed/memory/disk performance of individual distance calculations vs. the stratified bands?

This will be something to explore in more depth as we begin using sampling weights for model estimation..

Also tagging @waddell @mxndrwgrdnr @Arezoo-bz

gboeing commented 6 years ago

have you done any evaluation of the speed/memory/disk performance of individual distance calculations vs. the stratified bands?

@smmaurer I've been working on this over the past couple weeks. I'm getting to the point where I feel a full distance matrix (ie individual distance calculations) will not be feasible without much better hardware. In the coarsest-grain graph of the bay area, we're looking at a matrix of 4-5 billion elements. I can't even instantiate this in memory, let alone populate it at this point. Regarding speed, no matter how fast of an algorithm we use (pandana, etc) to calculate these pairwise network distances, we'll still have to wrap that algorithm in a loop and have it run billions of times. Essentially another nonstarter. Hence I've been focusing most of my attention and efforts on calculating stratified bands, as it seems tractable.

smmaurer commented 6 years ago

Most of this is implemented in PR #37. Moving discussion to Issues #39, #40.

UDST / choicemodels