Closed smmaurer closed 6 years ago
@mxndrwgrdnr Thanks for taking a look at this. Yeah, these are good points. I guess we could also leave pandas formats as the default but provide an option for passing numpy arrays directly, which we could use in the UrbanSim templates to maximize performance. I'll look into this more as i'm building out the capacity-constrained choice simulation, which i think will have much worse performance.
This PR adds functionality for efficient Monte Carlo simulation of choices for a set of K scenarios, each having different probability distributions (and potentially different alternatives). Choices are independent and unconstrained, meaning that the same alternative can be chosen in multiple scenarios.
This is a component of issue #26. With this PR, we have full support in ChoiceModels for unconstrained choice simulation. The next PR will handle capacity constraints. A separate PR in UrbanSim Templates will provide access to this logic.
Discussion
This PR adds a tool called
choicemodels.tools.monte_carlo_choices()
.Using this is equivalent to applying
np.random.choice()
to each of K scenarios, but it's implemented as a single-pass matrix calculation. This is about 50x faster than usingdf.apply()
or a loop. The algorithm is adapted fromurbansim.urbanchoice
.For cases where all the choice scenarios have the same probability distribution among alternatives, you don't need this function. You can use
np.random.choice()
withsize=K
, which will be more efficient. (For example, that would work for a choice model whose expression includes only attributes of the alternatives.)PR includes a unit test that confirms the simulated choices align with the provided probabilities.
Usage
This is implemented as a general-purpose function that can accept any list of indexed probabilities -- so it will work with output from our own MNL estimator, or PyLogit, or future model types. It can be called directly or used as the back end for a model template.
Performance
Overall the performance is excellent, especially compared to
df.apply()
as noted above.Simulating choices is faster than calculating choice probabilities from the MNL utility equations. For 1 million choice scenarios with 10 alternatives each, calculating the probabilities takes 1.0 seconds and then simulating choices takes 0.5 seconds, on an old i5 MacBook.
Although this seems fine in absolute terms, it's worth noting that it's a little bit slower than the 100%-numpy implementation in the original
urbansim.urbanchoice
codebase. It looks like this is caused by overhead from requiring the probabilities to be formatted as an indexed pandas object.Profiling indicates that 65% of the execution time, and the vast majority of memory usage, comes from a couple of initial pandas operations. The numpy matrix math is very efficient in comparison.
I think for now, the clean data format is worth the performance hit. But I'd like to go through and do more careful profiling of other parts of the codebase in light of this.
Other changes
MultinomialLogitResults()
constructor and makes theestimation_engine
parameter optionalMultinomialLogitResults.probabilities()
Versioning