Simulate REPM configs generated by autospec

sample-mpo commented 6 years ago

This PR facilitates simulation of autospec-generated real estate price model (REPM) configuration files. This is the code counterpart to #10

Eh2406 commented 6 years ago

We also need to have code for re-estimation. Like HLCM Estimation.ipynb in #9.

Once that is added we can merge and test. If the model run looks good, then we will still need to:

remove dead code
remove dead configs
add new steps to the *_run_tests

Edit: To be clear I want @sample-mpo to make the Estimation code. then I can do the rest of the list.

janowicz commented 6 years ago

A REPM Estimation.ipynb notebook was added to this PR for re-estimating any of the autospec-generated regression configs. The notebook has the same structure as HLCM Estimation.ipynb so should be straightforward to use.

@Eh2406 @semcogli Please note: A test simulation was run with these REPM configs and without the property value scaling step. The simulation completed without encountering the lack of feasibility issue, indicating higher prices. However, upon inspection of the resulting prices, while the median predicted prices by geography looked reasonable, there were some excessively outlying prices (on the high side) being predicted, skewing mean prices upwards. As mentioned in the meeting on Thursday, no filter on high-outlying observations was applied during the autospec estimation process for these conigs. The presence of some very-excessively high price values in the price predictions using these configs suggests that additional estimation filters should be applied. Given this, my recommendation would be to hold off on merging these REPM pull requests, and that we proceed with calibration this week using the model's existing (non-autospec) price models:

Calibrate supply model to supply growth targets using the existing price model configs
Test run of the calibrated simulation without property value scaling step
Only if needed, run calibrated simulation with property value scaling step

Eh2406 commented 6 years ago

Hi

Thank you so much for the update. As to excessively high prices coming out of hedonic model; the existing price model top code the output to avoid inf. Code. I think it's better to fix it within hedonic code. Squashing it in to_frame will hide bugs in all other models.

Do you have a plan for when to rerun autospec with filtered input? What can we be doing to help keep things moving?

janowicz commented 6 years ago

Ah yes- good to know about the top code in the hedonic- could be applied in the hedonic model step here too.

The inf/nans replacements in to_frame here is actually to deal with inf/nans in the column inputs to the model, not output. For some reason, there were a small handful of infs in a couple of the computed explanatory variables. Will need to look at variable definitions. Definitely not a long term solution to have an inf/nan replacer here, agreed, but just wanted a quick-fix to get to a full simulation run. Would you guys mind doing (after build_networks, neighborhood variables have been run) an orca.get_table("buildings").to_frame() when you get a chance and see if you can find any NaNs or infs?

On our end, today we can re-run the autospec hedonics with additional filters, and add the 1000.0 price value top-code to the hedonic simulation code in this PR. And then hopefully move onto calibration by tomorrow.

Eh2406 commented 6 years ago

Working on tracking down the inf/nans :-)

Eh2406 commented 6 years ago

looks like it is bad 'parcel_id''s trying to track down why.

Eh2406 commented 6 years ago

So I talked to our data pepal: The Nan will be fixed, but not quickly, in the meantime we can ignore the 98 buildings. When fitting the models we should ignore the buildings type 99, they are out buildings. And the 1000 number is arbitrary. we may want to use something lower for topcoating the output, and definitely want something lower for filtering when fitting the model. @semcogli is looking up what he used.

semcogli commented 6 years ago

I tried to look for the notebook I used in estimation. Unfortunately, I haven't found it yet. But I roughly remember I used something like within 3 SDs from the mean as a filter. Other filters already mentioned in Jacob's comments.

Eh2406 commented 6 years ago

I see the new commit on #10! What are our next steps? :-)

janowicz commented 6 years ago

Running a quick simulation test (to make sure the updated configs simulate ok) on our end, and then hoping to narrow down the NaN-replacement-in-input-variables code so as to be less broad. Once this is done in the next few hours, the next step will be to do a full run using this branch alongside the configs from #10 .

In parallel, we are preparing/test-running the calibration script, which can then re-run based on whichever price model configs we settle on.

Thanks for the feedback on filters and NaNs!

janowicz commented 6 years ago

Ready to test with the updated configs in #10 :). Please note that I removed the null/inf replacers as a 5-year test run showed that invalid values in input variables were not appearing anymore- I think autospec may have selected different explanatory variables this time around that did not contain null/infs (this autospec run had additional filters, and a tweaked autospec specification recipe to remove a variable too correlated with price). Related to the discussion above about top-coding- I added a floor/cap to the price output so that predicted prices are between 1 and 700- but feel free to change.

Eh2406 commented 6 years ago

Merging this and #10! thanks for the work!

SEMCOG / semcog_urbansim

Simulate REPM configs generated by autospec #11