Integrate changes from ej-test branch

smmaurer commented 5 years ago

This issue is to plan out integrating the changes from the ej-test branch into the main codebase, so that models using that code can more easily take advantage of other bug fixes and updates.

I can do the implementation, with input from @janowicz! (cc @cvanoli)

Here's the code diff. The ej-test branch is based on v0.1.dev15.

1. probas() method added to LargeMultinomialLogitStep

See large_multinomial_logit.py in the diff.

After a model is fit, this provides choice probabilities for a single representative chooser and the full list of alternatives. Looks like it won't work if there are attributes of the chooser in the model expression.

The same thing can be accomplished with built-in functionality now, i think, but to make it easier i'll retain the fitted MNL model as an attribute of the model step.

With a fitted step m, this code will currently give you a copy of the underlying MNL model:

from choicemodels import MultinomialLogitResults

model = MultinomialLogitResults(model_expression = m.model_expression, 
                               fitted_parameters = m.fitted_parameters)

I'll update it so you can get the same thing like this:

model = m.model  # attribute provides MultinomialLogitResults object

Then you can get probabilities like this:

alts = orca.get_table(m.alternatives).to_frame()

# can also use urbansim_templates.utils.get_data() to auto-filter the rows and columns,
# or choicemodels.MergedChoiceTable() to merge choosers and alternatives

probs = model.probs(alts)

This won't quite work at the moment --probs() expects observation data as well as alternatives -- but it will be a quick fix.

2. Choice column deleted from the observations table for MNL simulation

See large_multinomial_logit.py#L479 in the diff. This looks like a bug fix?

The MNL data-loading code was refactored in v0.1.dev21, and I don't think it pulls the choice column any more: large_multinomial_logit.py#L497-L501

3. In OLS simulation, predicted values are added as a new column

See regression.py in the diff.

Not sure whether this is to fix a problem with indexing, missing values, or the column not existing yet. @janowicz, let's discuss!

4. In the shared _get_df() method, table is filtered to drop non-relevant columns

See shared.py in the diff.

Same thing is implemented in 0.1.dev21.

5. Special case where buildings are choosers

See shared.py#L200-L202 in the diff. Let's discuss this one too, @janowicz!

Tasks

[x] add a LargeMultinomialLogitStep.model attribute that provides fitted model object
[ ] ~~fix choicemodels.MultinomialLogitResults.probs() to accept df of alternatives only~~
[x] enable on-the-fly creation of output columns
[x] fix string comparison in binary logit to use "==" instead of "is"
[x] fix bug where there are no valid choosers
[x] unit tests for these fixes
[x] test updates against code that's currently using the ej-test branch

janowicz commented 5 years ago

@smmaurer: 1.) Great, thanks! The idea you propose of getting probs without having to use choicemodels is useful.

3.) The change here was made because the predicted price column doesn't exist prior to simulation (estimated off of a separate dataset).

5.) This was for estimating a development project location choice model, with buildings as choosers and parcels as alternatives (MRCOG wanted a lcm-based backup to the proforma model). The issue I think was that explanatory variables with the same name existed on both the buildings table and parcels table, so there was a merged-table-column-name-conflict. I could suggest that they name the variables differently. Any other ideas come to mind?

Sounds good about #2 and #4 being already taken care of in master! I'll create an issue in consulting-planning to test the various project simulations using the latest templates code.

smmaurer commented 5 years ago

Thanks @janowicz! This all makes sense and i've edited the task list above accordingly. Will start a new branch with these updates.

I'm not sure about part 5. It's possible this will have been fixed with 0.1.dev21 as well. But from the ej-test code, it looks like the duplicated column name is used in the model expression, which is more of a challenge..

What i've been thinking about as a solution for cases like this is to create a derived variable with an unambiguous name (for now in variables.py, soon with a template call).

Example: suppose you have columns households.tract_id and jobs.tract_id and you don't want to change the data schema, but you need to estimate a job choice model where both ids appear in the choice table (e.g. in order to calculate distance as an interaction term). You could create derived variables households.household_tract_id and jobs.employment_tract_id that are just copies of the other columns, but can be used unambiguously in the model expression.

This seems better to me than having a conflict resolution rule where the columns are renamed on the fly (or where one or the other gets dropped), because it's easier to track. Do you think this would be a good strategy?

janowicz commented 5 years ago

@smmaurer Sounds good! And so that things are more explicit, I agree completely about variables in the model expression with new name (e.g. an alieas) rather than implementing a name-conflict-resolution rule.

smmaurer commented 5 years ago

@janowicz About part 1 of the initial comment, getting probabilities for a representative chooser..

No problems making the choicemodels.MultinomialLogitResults object a persistent attribute of a fitted model step. That's taken care of.

But I'm not sure what the best approach is for getting probabilities across all the alternatives. This is what I had suggested initially:

alts = orca.get_table(m.alternatives).to_frame()
probs = m.model.probs(alts)  # not allowed yet; requires a MergedChoiceTable

We really do need some kind of info about the choosers, though, because probabilities are relative to other alternatives in the same choice scenario. mnl_simulate() asks for a numalts and requires rows to be consecutive; probs() requires instead that there be id columns for chooser and alternative.

So this would be a more realistic code snippet:

alts = orca.get_table(m.alternatives).to_frame()
obs = orca.get_table(m.choosers).to_frame().sample(n=1)
probs = m.model.probs(MergedChoiceTable(obs, alts))

Would it be reasonable to do something like that in your diagnostics code whenever you need these probabilities? I could also make some kind of probs_for_representative_chooser() function to automate it -- but we'd want to accept arguments for custom filters, which chooser to use (if there are characteristics of the chooser in the model expression), and so on, and in the end it might not be much easier for the user than coding it directly.

smmaurer commented 5 years ago

The work outlined in this issue is now complete.

With help from @cvanoli, we tested the LCOG model code using the updated master branch of urbansim_templates. After a couple of small fixes in PR #82 and #85, everything is now working.

UDST / urbansim_templates