Separate out processing permanent counts from countmatch.reader

cczhu commented 4 years ago

Based on comments in #25 and discussions with @aharpalaniTO, we'll be revising how we treat permanent counts. To do this, we can no longer identify and process permanent count locations while reading in data year-by-year, since data imputation and outlier detection for permanent count stations requires data from other years.

To resolve this:

[x] Separate out permanent count processing methods from AnnualCount. These will eventually go into
[x] Rename AnnualCount to RawAnnualCount, since it's only used by Reader and gets discarded as soon as multiple years are combined into single multi-index tables.
[x] Rework Reader so that it no longer distinguishes between PTCs and STTCs - all locations will be handled the same way.
[x] Refactor growthfactor.py into permcount.py, which will handle all permanent count processing, including outlier detection. We'll move methods into their own files in a subsequent issue if permanentcount.py gets too long. This will also solve one point in #26 .
[x] Refactor test suite to handle these changes. We might need to wait until #33 is finished before finalizing all tests.
[x] Handle #21 in the process.

cczhu commented 4 years ago

Had a rethink of this task, and now realize why Arman was reluctant to start mucking around with data imputation of permanent count stations, especially if he already had PECOUNT working to fill in missing ~2-3 months of data.

TEPs-I's PRTCS algorithm flags PTC years as having data in all 12 months and at least 3/4 of the year. Derived properties, like the ratio between daily traffic and AADT, are only calculated on those years.
Following TEPs-I's algorithm exactly, then imputing the few ratio matrices with missing values, would speed up CountMatch. This is because it would guarantee that eg. any given day of week within a month would have a ratio to use, and we wouldn't have to do a search within a matrix for the closest day of week or closest year. It would not affect the predictive accuracy of the model, though.
Imputation of derived properties is hard, since:
- We'd have to keep everything as self-consistent as possible - imputing the MADT, AADT, and ratios independently of each other may lead to inconsistencies.
- How should AADT even be imputed? Should we use Scikit-Learn's iterative imputer, TEPs-I's growth model, or something completely different (such as using the average of daily_traffic * D_ijd for all available data in that year, given that's what CountMatch suggests doing, or just using the MADT patterns from the closest year).
- If we impute the raw data, how should we even go about doing it? Isn't this what PECOUNT is for? How about using GPR instead?

cczhu commented 4 years ago

Revised class design:

A DerivedVals base class that stores methods for calculating derived properties with no imputation. Since imputation is likely to be a strategy and not a single step in a data reduction pipeline, more sophisticated derived value calculations could subclass this base class.
A GrowthFactor class that stores methods for calculating the growth factor, and results of the particular fit.
A PermCount class that stores both processed data and results/diagnostics of DerivedVals GrowthFactor instances within it. If it turns out we don't need to store anything from DerivedVals or GrowthFactor other than the results, we can mount them in PermCountProcessor instead of here (in which case PermCount will only be a useful indicator that the data structure is different than a regular count).
- Processed values need some measure of data accuracy, and an indicator of whether it was imputed or not. For daily counts, simply having an Imputed column is enough (we could also use a masked DataFrame, but storing the mask would be more expensive than storing the column, and it would look different than for the derived properties). For derived properties that are averaged over multiple days, store the number of days, and -1 if the value was imputed.
A PermCountProcessor class that cycles through every count, first determining if it meets the requirements for being a PTC, then running the derived value and growth factor calculations. If it turns out we don't need to store anything from DerivedVals or GrowthFactor other than the results, we can mount them here rather than in PermCount.

Revised work plan:

[x] code up the framework for processing permanent counts, including default derived property processing.
[x] code up tests for the new framework
[x] create a basic imputer that only fills gaps in derived values after calculating them, to speed up CountMatch.
[ ] determine how many more locations x years could be made into permanent count location-years if we started relaxing the criteria to be a permanent station. If there are tons of locations, or tons of years in current permanent count locations, that would be considered permanent with a small relaxing of criteria, it would justify further work on imputation.
[ ] create a slightly more advanced imputer that handles a small number of missing months using Bagheri's algorithm (eg. estimating AADT by using conversion factors from the nearest year).

Outstanding question: if DerivedVals and GrowthFactor classes have lots of method-specific flags, how should we control for these in config.py? Nested dicts?

cczhu commented 4 years ago

As I'm refactoring to allow for imputation and other multi-year preprocessing of permanent counts, I've created a design where all data from a count location across multiple years are stored in the same object. The amount of data taken at a location can drastically change from year to year, and the permanent count criteria are defined in terms of a single year's data, so currently each location's data is checked year by year using PermCountProcessor.partition_years. The output of this function is perm_years, a list of years that meet the permanent count criteria. Once a PermCount object is created for the count location, perm_years is stored within it, and is used in a number of subsequent methods for calculating derived values and growth factors.

There are a number of issues with this design:

perm_years can freely be modified by the user, but it's a property governed by the general criteria they set.
partition_years is in PermCountProcessor because that class handles partitioning counts into STTCs and PTCs, but I'm not a fan of a class instance's fundamental properties being calculated by other classes.
AADT can only be calculated for years satisfying the permanent count conditions, but MADT can be calculated for months outside of permanent count years. This is not transparent to the user, and makes further code development susceptible to bugs.
It's unlikely, but I can foresee a situation where perm_years changes due to iterative imputation.

Alternative design:

Make perm_years a (lazy?) property of PermCount, and partition_years a method. perm_years should be calculated in __init__, and raise an error if perm_years has zero length. from_count_object should then initialize an instance using self = cls(...) in the middle.
PermCountProcessor should try to initialize a permanent count instance for every single count location in a try/except block, and create a short term count if the initialization fails.
Just like with some growth factor methods, if a derived values method only calculates a property for a permanent year, its name should include _py.

cczhu commented 4 years ago

Decided to adopt points 1 and 3, but not 2. perm_years is now an instance property, but it is still initialized by PermCountProcessor. This is because practically everything else is initialized by PermCountProcessor (all derived values and growth factors, for example), so it would be strange to use an entirely different pattern to attempt initialization of perm_years, especially since it needs to communicate back to its parent whether it succeeded in making a permanent count. We can always refactor later.

cczhu commented 4 years ago

Resolved by #37

CityofToronto / bdit_traffic_prophet

Separate out processing permanent counts from countmatch.reader #32