Closed smmaurer closed 6 years ago
Digging into it a bit more, I think the clearest justification for this implementation is in calculating sampling weights.
For J choosers (maybe millions) and K alternatives (maybe millions), we would need to generate J x K sampling weights, but only K of them would need to be in memory at any given time (for passing to np.random.choice
).
Interaction data columns can be generated after the sampling, which would be easier in most cases than writing a subclass of InteractionGenerator()
. For example:
mct = MergedChoiceTable(choosers, alternatives, sample_size=10)
# relative price = alternative's price / chooser's income
df = mct.to_frame()
df['relative_price'] = df.price / df.income
I think that would add the column directly into the object's underlying dataframe, since df
is a reference, but we should probably write explicit methods for this.
Most of this is implemented in PR #37. Moving discussion to Issues #39, #40.
We need a way to generate columns of data that represent interactions between chooser and alternative. This could be for distances between locations, for weights that vary depending on the category of chooser, and so on.
I'm proposing an
InteractionGenerator()
class for storing such relationships and calculating them on demand. This approach provides computational and memory efficiencies when there are very large numbers of choosers and alternatives.InteractionGenerator()
would be a template class. We'll provide a couple of implementations, likeDistanceGenerator()
for calculating distances, and advanced users can write their own.Usage example:
Another common use case will be providing an
InteractionGenerator()
to specify sampling weights.There is a rough sketch of these classes in my branch of the code: interaction.py#L21-L85