Closed cczhu closed 4 years ago
As mentioned in #32 (in this comment), we're currently not altering the criteria for flagging permanent count years, and only using the imputer to plug NaN
holes generated by get_ratios
. We've spun off a new issue (#38) that looks at how far we can take data imputation.
Test notebook now in sandbox branch.
Resolved (sort of) by #39
As mentioned in #25, TEPs doesn't impute missing values from permanent counts, which significantly limits the number of permanent counts available to reference from. There are plenty of PTCs which turn into STTCs in other years. It seems reasonable to employ a multi-stage algorithm that first imputes missing data before attempting to associate STTCs with PTCs. While it might be more difficult to impute missing daily counts in certain years, we expect day-to-month and day-to-year conversion factors to follow regular patterns that would allow for easy data imputation (and outlier detection, but that's a separate issue) using scikit-learn's Iterative Imputer.
We'll need to handle #32 first to allow
countmatch
to support data imputation.I think the original brainstorm suggested this would take 15 days, which is pretty ridiculous. Rigging up a test where we drop some random
DoM
values and check the MAE of the imputed values can't be that hard. I predict the tasks below should take 2 days:Tasks:
countmatch.permcount
.permcount
to match.