UDST / choicemodels

Python library for discrete choice modeling
https://udst.github.io/choicemodels
BSD 3-Clause "New" or "Revised" License
74 stars 33 forks source link

Sampling of alternatives: performance optimization #39

Open smmaurer opened 6 years ago

smmaurer commented 6 years ago

This is a new issue to discuss performance optimization for sampling of alternatives in the MergedChoiceTable utility.

Earlier discussion is in issues #4, #5, and #11, now closed. Core functionality is implemented in PR #37.

Performance of the underlying sampling algorithms

Sampling items from a list is fastest with uniform weights and "replacement" of items (meaning that an item can appear multiple times in the sample).

In my tests, NumPy sampling implementations are generally faster than core Python (comparing NumPy 1.14 and Python 3.6 on Mac), but the differences are small.

Typical performance penalties

Sampling time scales linearly with the sample size and also with the population size -- except for uniform sampling without replacement, which is essentially constant with respect to the population size.

The penalties above are for samples and populations greater than a few hundred thousand; with smaller datasets the performance is more idiosyncratic.

Performance in the context of MergedChoiceTable

In real-world use, there's also overhead from building and merging large DataFrames.

At the extremes, this can be quite significant. For example, generating a merged table from 1,000 choosers and 1,000 alternatives, without sampling (i.e. 1 million rows) takes about 40x longer than uniform sampling of 10 alternatives per chooser with replacement.

Here are some MergedChoiceTable performance benchmarks for sampling 10 alternatives from a set of 100k, for 100k choosers:

Sampling method Time
Uniform sampling with replacement 0.3 seconds
Alternative-specific weights, with replacement 0.4 seconds
Uniform sampling without replacement 4.5 minutes
Alternative-specific weights, without replacement 6.5 minutes

At this scale there is only minimal penalty for non-uniform weights. But sampling without replacement is very inefficient, because it requires a separate random draw for each chooser.

Interaction weights, which depend on both the chooser and the alternative, have the same problem. At smaller scales, the performance penalty is about 180x, a little worse than the penalty for sampling without replacement. (For this table I wasn't even able to run the benchmark, because it would require 10 billion interaction weights.)

These benchmarks are from a fast i7 iMac; an old i5 MacBook is about 50% slower.

Comparison with the older UrbanSim MNL codebase

Our older codebase, urbansim.urbanchoice, only supported the basic case of uniform sampling with replacement. It performs a little bit better than the new codebase for this use case, probably through better optimization of the data merging operations.

Resources

Sampling-benchmarks.ipynb

Next steps

Here are some potential lines of investigation:

  1. Why are the repeated random draws (required for sampling without replacement or for interaction weights) so slow? Can we optimize it better?

  2. If the underlying sampling algorithms are part of the problem, are there alternatives to NumPy that would perform better?

  3. For interaction weights, can we get better performance by using generator functions or a different lookup format? What about optimizing for an additional case with shared weights ("buckets") that don't require as many separate draws?

There's also the question of how to best calculate the weights themselves, which is a separate problem.

smmaurer commented 5 years ago

Just a note that there's nicer code now for the "alternative-specific weights, with replacement" case. We wrote it when working on choice simulation.

It's available as choicemodels.tools.monte_carlo_choices(), and can probably be swapped right into MergedChoiceTable for sampling of alternatives.

See PR #43 for details and performance assessment.

smmaurer commented 4 years ago

Noting that the link is broken to the Sampling-benchmarks notebook. It should still be there somewhere, though.