UDST / choicemodels

Python library for discrete choice modeling
https://udst.github.io/choicemodels
BSD 3-Clause "New" or "Revised" License
74 stars 33 forks source link

Calculating interactions between chooser and alternative #4

Closed smmaurer closed 6 years ago

smmaurer commented 7 years ago

We need a way to generate columns of data that represent interactions between chooser and alternative. This could be for distances between locations, for weights that vary depending on the category of chooser, and so on.

I'm proposing an InteractionGenerator() class for storing such relationships and calculating them on demand. This approach provides computational and memory efficiencies when there are very large numbers of choosers and alternatives.

InteractionGenerator() would be a template class. We'll provide a couple of implementations, like DistanceGenerator() for calculating distances, and advanced users can write their own.

Usage example:

choosers  # pd.DataFrame with index, lat, lng
alternatives  # pd.DataFrame with index, lat, lng

dg = DistanceGenerator(choosers, alternatives, type='straight_line')
print(dg.get_data(chooser_ids=[...], alternative_ids=[...])

# include the column in a merged & sampled table
merged_table = MergedChoiceTable(choosers, alternatives, sample_size=10, interactions=[dg])

Another common use case will be providing an InteractionGenerator() to specify sampling weights.

There is a rough sketch of these classes in my branch of the code: interaction.py#L21-L85

smmaurer commented 7 years ago

Digging into it a bit more, I think the clearest justification for this implementation is in calculating sampling weights.

For J choosers (maybe millions) and K alternatives (maybe millions), we would need to generate J x K sampling weights, but only K of them would need to be in memory at any given time (for passing to np.random.choice).

Interaction data columns can be generated after the sampling, which would be easier in most cases than writing a subclass of InteractionGenerator(). For example:

mct = MergedChoiceTable(choosers, alternatives, sample_size=10)

# relative price = alternative's price / chooser's income
df = mct.to_frame()
df['relative_price'] = df.price / df.income

I think that would add the column directly into the object's underlying dataframe, since df is a reference, but we should probably write explicit methods for this.

smmaurer commented 6 years ago

Most of this is implemented in PR #37. Moving discussion to Issues #39, #40.