Estimation Mode Refinement

joecastiglione commented 1 year ago

Consolidated Improve Estimation Functionality (https://github.com/ActivitySim/activitysim/issues/737) with this issue
Consolidated Complete Estimation Mode for Trip Models (https://github.com/ActivitySim/activitysim/issues/726) with this issue

Estimation Mode Refinement

The current Estimation Mode features of ActivitySim are very much built in the spirit of similar functionality in DaySim: updated observed (survey) data can be fed into the process, and the combination of ActivitySim and Larch can work together somewhat automatically to generate updated parameter estimates. The tools that have been built allow Larch to construct a model that exactly mirrors the defined model in ActivitySim, re-estimate model parameters, and output ActivitySim coefficient files with new parameter estimates that can be used as a drop-in replacement for the existing coefficient files.

However, this tight integration breaks down when the user wants to update not only the coefficient values but also the functional form of utility equations. This leaves the user with two choices: (a) returning to ActivitySim for every tiny change to the specification files and re-running the entire estimation mode process, or (b) editing the utility equations in Larch while exploring different functional forms, and then needing to reconstruct matching specification files later in ActivitySim once the desired function form is selected. The former solution is tedious and slow, while the latter solution is error prone and requires fairly expert level understanding of the usage of both ActivitySim and Larch.

The goals of this task would be to more tightly integrate Larch and ActivitySim, to achieve (1) allowing users to move between these tools using a common utility specification format, (2) to speed up the generation of data to support revisions to utility functional forms, especially for large data bundles (i.e. destination and scheduling components, possibly by sampling), (3) to extend and enhance the documentation of the estimation process, and (4) improve error handling.

Improve Estimation Functionality

ActivitySim currently uses the Larch software to estimate models, which allows estimation results to be used directly by the simulation — dramatically reducing errors common in translating utility expressions into the ActivitySim specification. In version 1.5, numerous improvements are made to the estimation procedures including reducing the size of the estimation data bundles, increasing the speed at which they can be estimated, improving the reporting and error messaging capabilities of Larch, and improving the usability of the coefficient files created by the ActivitySim procedures. Further, an auditing will be done to confirm that the estimation procedures for each ActivitySim component are working as expected.

Complete Estimation Mode for Trip Models

Fully implement estimation mode for all submodels. Seem like we are going to get close in phase 5, but may not be totally complete.

Additional Description:

Purpose: Finish estimation integration which is targeted to all agencies
Create and clean-up example survey files for trips
Add trip models to infer module
Implement estimators for trip models
Implement a larch estimation notebook for trip destination and mode choice
Add tests and documentation
Additional tidying up of estimation integration as budget allows

jfdman commented 1 year ago

One of the main problems in estimation mode is that the estimation data bundle files written out by ActivitySim are not efficient. For example, instead of constructing a destination choice sample where only sampled alternatives are in the choice set, instead an alternatives table is created with every destination listed, with missing values for unsampled alternatives. The other issue is that the choosers data written to the EDBs is limited to just the fields used in the existing utility equations. Instead, all potential household, person, and land-use data fields should be written to EDBs by default. That would provide the model estimator access to all potential data items in estimation. Also the data formats used are inefficient; CSV files are slow to read. Replacing these files with binary should greatly speed up the estimation process. It also seems pretty straightforward (?) to write a function that would write out the utility equations that the analyst specifies in Larch to a revised model spec file. These seem like relatively simple fixes that would address a lot of the problems we are experiencing.

dhensle commented 1 year ago

Distilling the above conversations down into a "wish list" of improvements:

Multi-processing functionality in ActivitySim to quickly generate EDBs
Improve formatting of EDBs to increase usability for large models like location choice, scheduling, etc. (the "interaction simulate" family of models)
Enhance Larch and/or it's integration with ActivitySim to allow for faster iteration when trying new specifications. This might include adding the ability for larch to calculate new variables if missing from the current table, adding coefficients automatically if missing from the input coefficients file, putting in check-pointing for loading in data into larch, etc.
Ensure all models in the ActivitySim core code have an estimation example that gets EDBs created and estimated in larch.
Add an example data processing pipeline that would demonstrate the process of taking a "typical" survey, produce EDBs, and include the handling of multi-day survey data. This would include the "infer.py" module. (Would also be super useful for CI testing to get this data set determined.)
Add functionality to just run one step in estimation mode without having to go through all the previous steps. (Unclear if this is even "possible", but it would be nice to decease the iteration time from changing input survey data to get the downstream trip models to not crash.)

ActivitySim / activitysim

Estimation Mode Refinement #731