Open smmaurer opened 6 years ago
Just a note that there's nicer code now for the "alternative-specific weights, with replacement" case. We wrote it when working on choice simulation.
It's available as choicemodels.tools.monte_carlo_choices()
, and can probably be swapped right into MergedChoiceTable
for sampling of alternatives.
See PR #43 for details and performance assessment.
Noting that the link is broken to the Sampling-benchmarks notebook. It should still be there somewhere, though.
This is a new issue to discuss performance optimization for sampling of alternatives in the
MergedChoiceTable
utility.Earlier discussion is in issues #4, #5, and #11, now closed. Core functionality is implemented in PR #37.
Performance of the underlying sampling algorithms
Sampling items from a list is fastest with uniform weights and "replacement" of items (meaning that an item can appear multiple times in the sample).
In my tests, NumPy sampling implementations are generally faster than core Python (comparing NumPy 1.14 and Python 3.6 on Mac), but the differences are small.
Typical performance penalties
Sampling time scales linearly with the sample size and also with the population size -- except for uniform sampling without replacement, which is essentially constant with respect to the population size.
The penalties above are for samples and populations greater than a few hundred thousand; with smaller datasets the performance is more idiosyncratic.
Performance in the context of MergedChoiceTable
In real-world use, there's also overhead from building and merging large DataFrames.
At the extremes, this can be quite significant. For example, generating a merged table from 1,000 choosers and 1,000 alternatives, without sampling (i.e. 1 million rows) takes about 40x longer than uniform sampling of 10 alternatives per chooser with replacement.
Here are some
MergedChoiceTable
performance benchmarks for sampling 10 alternatives from a set of 100k, for 100k choosers:At this scale there is only minimal penalty for non-uniform weights. But sampling without replacement is very inefficient, because it requires a separate random draw for each chooser.
Interaction weights, which depend on both the chooser and the alternative, have the same problem. At smaller scales, the performance penalty is about 180x, a little worse than the penalty for sampling without replacement. (For this table I wasn't even able to run the benchmark, because it would require 10 billion interaction weights.)
These benchmarks are from a fast i7 iMac; an old i5 MacBook is about 50% slower.
Comparison with the older UrbanSim MNL codebase
Our older codebase,
urbansim.urbanchoice
, only supported the basic case of uniform sampling with replacement. It performs a little bit better than the new codebase for this use case, probably through better optimization of the data merging operations.Resources
Sampling-benchmarks.ipynb
Next steps
Here are some potential lines of investigation:
Why are the repeated random draws (required for sampling without replacement or for interaction weights) so slow? Can we optimize it better?
If the underlying sampling algorithms are part of the problem, are there alternatives to NumPy that would perform better?
For interaction weights, can we get better performance by using generator functions or a different lookup format? What about optimizing for an additional case with shared weights ("buckets") that don't require as many separate draws?
There's also the question of how to best calculate the weights themselves, which is a separate problem.