Brainstorm modifications to CountMatch

This is a clearinghouse of issues and ideas to test and improve CountMatch.

First, a breakdown of the PRTCS algorithm following data ingestion and averaging, and calculating the growth rate. All my problems with it are in bold:

For each traffic daily count (TDC; each row of Ms_abs):

Determine 5 closest centreline segments with PTCs. For PTCs, determine 5 closest other PTCs. Retrieve the corresponding PTCs for the same direction One could argue this improperly associates N/E and S/W roads together.
Determine the AADT growth factor, which is currently an unweighted mean of all PTC growth factors. (This is true for every row of Ms_abs.)
For each PTC:
1. For each year, determine the ratio between each day's Daily Count and the year's AADT.
2. Determine the number of days of the week in each year and month. (TEPs does this in a previous step, but I do it using the Daily Count table in this one).
3. Compare the TDC day of the week and year to all the data from the PTC. If the day of week exists in the PTC data, pick all data from the closest year to the TDC year which has the day of week. Otherwise, just pick the closest year.
4. Take the nan-exclusive average over the picked data of the day-to-AADT ratio (D_ij in Bagheri), day-of-week-averaged-over-the-month daily traffic, day-to-month factor and MADT. Also save the AADT of the selected year. Since there are five PTCs for each row of Ms_abs, we now have a new, augmented table with 5 x Ms_abs rows. Eqn. 4 of Bagheri suggests we should be calculating a D_ijd, a day-to-AADT ratio for each year, day of week AND MONTH. TEPs is not doing this.
Calculate:
1. The preliminary AADT estimate (Eqn. 3 Bagheri) using the average of all the STTC data (regardless of year!) multiplied by the day-to-AADT ratio from 3d). and growth factor to the wanted_year - selected_year power (where selected_year is from 3c).). Bagheri Eqn. 3 clearly shows that different D_ij are used for different days of the week, month and year, but this is not done in TEPs.
2. The preliminary MADT - the TDC multiplied by the averaged DoM from 3d). and the growth factor to the wanted_year - selected_year power. (Eqn. 2 of Bagheri). Likewise, why doesn't TEPs adhere to Bagheri Eqn. 2, which shows MADT being estimated using a DoM specific to the month, year and day of week?
3. MF_STTC, the ratio between 4b). and 4a).
4. MF_PTC, the ratio between the average PTC MADT and AADT from 3d).
For each PTC and STTC pair (regardless of year), calculate the MSE:
1. Determine the daily-count-wise square deviation i.e. calculate (MF_STTC - MF_PTC)^2 for every row of Ms_abs.
2. For each PTC and STTC pair, take the mean of these daily-count-wise square deviations.
3. Determine the PTC-STTC pair that produces the minimum MSE. Note to self: it's okay that this is a multi-year comparison. Assigning a new PTC to an STTC for each year would likely increase noise, since we have very little STTC data each year.
For each PTC and STTC pair (regardless of year), calculate the mean D_ij. This will be used to estimate the final AADT. Bagheri Eqn. 7 implies a similar day-wise summation to Eqn. 3, which is also not what TEPs does.
Make an AADT prediction for STTCs for a user-specified year want_year:
1. Find the closest year of counts to the want_year
2. Find the mean daily traffic and mean growth factor (all these means are row-by-row, and many are means of the same number since there's a lot of repetition in DoM_STTC) for all Ms_abs rows also in the closest year.
3. Calculate the AADT using Eqn. 7 of Bagheri, augmented with a growth rate to the (want_year - closest_year) power if needed.
Make an AADT prediction for PTCs for a user-specified want_year:
1. Find the closest_year with available data to want_year. In some cases this leaves out a LOT of data from year with more prolific counts.
2. Find the mean daily traffic (averaged over rows, with no heed taken to weight by monthly representation). Why don't we just use the TRUE AADT? We have that!
3. Determine the AADT by multiplying the mean daily traffic with the growth rate to the (want_year - closest_year) power if needed.

The PRTCS validation algorithm:

For each PTC-PTC pair (regardless of year), calculate the MSE:
1. Determine the daily-count-wise square deviation i.e. calculate (MF_thisPTC - MF_neighbourPTC)^2 for every row of Ms_abs.
2. For each PTC-PTC pair , take the mean of these daily-count-wise square deviations.
3. Determine the PTC-PTC pair that produces the minimum MSE.
For each PTC-PTC pair (regardless of year), calculate the mean D_ij and pairwise average AADT. Why are we averaging a value that can be quite different year-on-year??
Make an AADT prediction for PTCs for a user-specified year want_year:
1. Find the closest year of counts to the want_year
2. Find the mean daily traffic and mean growth factor for all Ms_abs rows also in the closest year.
3. Calculate the AADT using Eqn. 7 of Bagheri, but augmented with a growth rate to the (want_year - closest_year) power if needed.
Retrieve the average PTC-PTC AADT (from 2.). Subtract the AADT by this average to get the error. This is an invalid comparison - we're comparing an AADT prediction for a specific want_year to an AADT averaged between the pair across multiple years!

Issues, in order of importance:

Not covered above, but the growth factor estimation method currently used by TEPs isn't self-consistent. (It also does something formally illegal, but this has already been corrected in Traffic Prophet).:
- TEPs uses a linear fit to determine the (fractional) weekly growth rate over a single year, then scales it up by 52 to obtain the annual (fractional) growth rate. This assumes that seasonal variations with no year-on-year trend leads to a linear fit with a slope of zero, which I find dubious. Can we consider a better way of estiamting AADT growth from a single year's observations? Time: 2 days
- TEPs uses an exponential fitting mechanism to determine a multi-year average annual (fractional) growth rate. Bagheri et al. 2013 suggests using the specific year-on-year growths instead (so that if there's a decline between 2012 - 2013 we don't average 2006 - 2018 and wipe that decline out), and we should experiment with implementing this. Time: 3 days
- TEPs spatially averages over all PTCs to get a global growth multiplicative factor of ~1.02 year on year. We should try using the nearest PTC instead. Time: 1 day
A number of PTCs have known data errors (eg. 20044187 between 2017-05-20 and 2017-05-22). These have not been cleaned and lead to extreme outlier growth factors. Time: 1 day
As discussed in 3 and 4, TEPs performs averaging quite differently from Bagheri (possily deliberately as a means to reduce noise?). Instead of trying to figure out the significance of each deviation, let's just properly reproduce Bagheri's algorithm and see if it leads to smaller errors. Time: 3 days
TEPs doesn't impute missing values from permanent counts, which significantly limits the number of permanent counts available to reference from. Can we generate more using Scikit-Learn's iterative imputer as an alternative to using PECOUNT or gaussian processes? Time: 15 days
- There are plenty of PTCs which turn into STTCs in other years. It seems reasonable to employ a multi-stage algorithm that first imputes missing months of the year before attempting to associate STTCs with PTCs.
For AADT predictions with PTCs, just use the AADTs. Time: 1 day
Maybe use the 10 closest road segments going in any direction rather than only using the same +1 or -1 direction? Time: 1 day.
Double-check if Arman's list of invalid PTCs should be relaxed - maybe these stations now correlate better with their neighbours? Time: 1 day.

Since the validation algorithm doesn't even consider comparable AADTs, we should instead reproduce Bagheri's testing method:

Determine the distribution for the number of data points per year, and number of years available, for each STTC.
For several hundred iterations:
1. Create a set of test STTCs by randomly selecting data from {x} days out of the year, for y years from each PTC. The parameters should be set using the distributions found in 1. Since each test STTC is only allowed to connect to neighbouring PTCs, we can have the same locations in both STTCs and PTCs.
2. Attempt to determine the AADTs of each test STTC using the PRTCS algorithm, for each want_year that we have AADT data for the test STTC. This will possibly bias our results for years where we have very few STTCs. For each AADT prediction, we can calculate an percentage absolute error, which gives us a distribution of PAEs for each year.
3. Determine the median absolute percentage error (MdAPE) for each year. We can also plot a plot of predicted vs. ground truth as Arman does. Time: 5 days

I'm also curious whether there's a characteristic distance between PTCs past which they start decorrelating with one another. This can be pretty easily checked by creating a correlation matrix, extracting pairwise correlations, and plotting those against distance. We could then set a minimum number of neighbours and a maximum distance for each STTC, and pick whichever returns more neighbouring PTCs. Time: 1 day

Also, which is more important - getting a close-by PTC from different year, or getting a further-away PTC from the same year? Time: 1 day

CityofToronto / bdit_traffic_prophet

Brainstorm modifications to CountMatch #25