Permanent count ratio imputer

cczhu commented 4 years ago

As mentioned in #25, TEPs doesn't impute missing values from permanent counts, which significantly limits the number of permanent counts available to reference from. There are plenty of PTCs which turn into STTCs in other years. It seems reasonable to employ a multi-stage algorithm that first imputes missing data before attempting to associate STTCs with PTCs. While it might be more difficult to impute missing daily counts in certain years, we expect day-to-month and day-to-year conversion factors to follow regular patterns that would allow for easy data imputation (and outlier detection, but that's a separate issue) using scikit-learn's Iterative Imputer.

We'll need to handle #32 first to allow countmatch to support data imputation.

I think the original brainstorm suggested this would take 15 days, which is pretty ridiculous. Rigging up a test where we drop some random DoM values and check the MAE of the imputed values can't be that hard. I predict the tasks below should take 2 days:

Tasks:

[x] Create an imputer method in countmatch.permcount.
[x] Create a test notebook to determine the MAE or MSE of the inputer's outputs. Ensure typical errors are within a few percent.
[ ] Revise test suite. Refactor the rest of permcount to match.

cczhu commented 4 years ago

As mentioned in #32 (in this comment), we're currently not altering the criteria for flagging permanent count years, and only using the imputer to plug NaN holes generated by get_ratios. We've spun off a new issue (#38) that looks at how far we can take data imputation.

cczhu commented 4 years ago

Test notebook now in sandbox branch.

cczhu commented 4 years ago

Resolved (sort of) by #39

CityofToronto / bdit_traffic_prophet

Permanent count ratio imputer #33