[0.2.dev3] Choice simulation without capacity constraints

smmaurer commented 6 years ago

This PR adds functionality for efficient Monte Carlo simulation of choices for a set of K scenarios, each having different probability distributions (and potentially different alternatives). Choices are independent and unconstrained, meaning that the same alternative can be chosen in multiple scenarios.

This is a component of issue #26. With this PR, we have full support in ChoiceModels for unconstrained choice simulation. The next PR will handle capacity constraints. A separate PR in UrbanSim Templates will provide access to this logic.

Discussion

This PR adds a tool called choicemodels.tools.monte_carlo_choices().

Using this is equivalent to applying np.random.choice() to each of K scenarios, but it's implemented as a single-pass matrix calculation. This is about 50x faster than using df.apply() or a loop. The algorithm is adapted from urbansim.urbanchoice.

For cases where all the choice scenarios have the same probability distribution among alternatives, you don't need this function. You can use np.random.choice() with size=K, which will be more efficient. (For example, that would work for a choice model whose expression includes only attributes of the alternatives.)

PR includes a unit test that confirms the simulated choices align with the provided probabilities.

Usage

from choicemodels.tools import monte_carlo_choices

probabilities =  # pd.Series indexed with observation id and alternative id
choices = monte_carlo_choices(probabilities)

This is implemented as a general-purpose function that can accept any list of indexed probabilities -- so it will work with output from our own MNL estimator, or PyLogit, or future model types. It can be called directly or used as the back end for a model template.

Performance

Overall the performance is excellent, especially compared to df.apply() as noted above.

Simulating choices is faster than calculating choice probabilities from the MNL utility equations. For 1 million choice scenarios with 10 alternatives each, calculating the probabilities takes 1.0 seconds and then simulating choices takes 0.5 seconds, on an old i5 MacBook.

Although this seems fine in absolute terms, it's worth noting that it's a little bit slower than the 100%-numpy implementation in the original urbansim.urbanchoice codebase. It looks like this is caused by overhead from requiring the probabilities to be formatted as an indexed pandas object.

Profiling indicates that 65% of the execution time, and the vast majority of memory usage, comes from a couple of initial pandas operations. The numpy matrix math is very efficient in comparison.

I think for now, the clean data format is worth the performance hit. But I'd like to go through and do more careful profiling of other parts of the codebase in light of this.

Other changes

reorders parameters in the MultinomialLogitResults() constructor and makes the estimation_engine parameter optional
improves the efficiency of MultinomialLogitResults.probabilities()

Versioning

updates ChoiceModels version number to 0.2.dev3

coveralls commented 6 years ago

Coverage increased (+0.7%) to 66.615% when pulling 21167d76462422cfe1ea1598d7b85a0d24edd28b on unconstrained-prediction into 9684d3af614ff9711a3bd9679757b7e2986b191c on master.

smmaurer commented 6 years ago

@mxndrwgrdnr Thanks for taking a look at this. Yeah, these are good points. I guess we could also leave pandas formats as the default but provide an option for passing numpy arrays directly, which we could use in the UrbanSim templates to maximize performance. I'll look into this more as i'm building out the capacity-constrained choice simulation, which i think will have much worse performance.

UDST / choicemodels