Minimum mean-square error "fitter" for count association

cczhu commented 5 years ago

The final step to PRTCS is to determine the closest permanent count location DoMADT pattern for each short term count location. The steps to Pythonize this in CountMatch are:

[x] Create a function that calculates and checks the (growth factor-weighted) mean square error between short term and nearby permanent count.
[x] Create a function that uses the above in loop to find matching permanent count locations for each short term one. This function will also use the KDTree solution from #13 to find neighbours.
[x] Minimum viable unit test suite for these functions.

For now, we will only associate one short term count location with one permanent count location. Some means of creating a weighted average might be worth investigating (though might also produce degenerate results, with multiple weighting solutions producing roughly the same (possibly bad) minimum MSE). We're also allowing traffic from either direction to be matched.

cczhu commented 4 years ago

A major concern of mine - the median distance between an STTC and its nearest PTC is 2.27 km, and the mean 2.52 km. The median distance to the second nearest PTC is 2.82 km, and the mean is 3.24 km. If correlation between roadways drops with distance, this could greatly reduce the predictive accuracy of CountMatch. (Toronto is roughly 20 x 40 km in size.)

Not to mention most of our PTCs are on highways.

The solution is to find new sources of data, to be discussed further in #19. This post just stresses the importance of that project.

Distribution of distance between STTC and nearest PTC:

cczhu commented 4 years ago

Created a lit_review branch to include raw notes of papers. Recorded some notes on the minimum MSE method for assigning PTCs to STTCs here.

(Didn't do this in the Wiki or Issues because of a lack of MathTeX support.)

cczhu commented 4 years ago

In DoMSTTC.m, Arman averages growth rates over all of Toronto (to get a year-on-year multiplicative factor of ~1.02). This seems like we're losing a lot of spatial resolution. Created an issue (#25) to continue recording my concerns.

cczhu commented 4 years ago

Another issue detailed in the known issues:

95: this improperly sums up all short term counts for the location and year, regardless of whether it's the correct day of week. Meanwhile, base_year is the year we're interested in calculating annual patterns, and sel_year is the closest year we have PTC data (year_ttc is the year we have STTC data). Preliminary AADT is calculated using GR_STTC^(base_year-sel_year), but it would be more reasonable to use GR_STTC^(base_year-ttc_year) since it's the absolute value counts from 2006 that need to be scaled by the multi-year growth rate, not the day-to-year pattern. Not sure if this is truly a bug or a deliberate choice I disagree with.

cczhu commented 4 years ago

MVP on fitter is working. Here's an output plot of predicted AADT at all STTCs in 2018:

An interactive map can be found in MatcherDev.ipynb, though it sadly doesn't work online.

Outstanding issues:

No self-consistency tests yet.
MSE is significantly worse than TEPs-I. Not sure why.

cczhu commented 4 years ago

Attempted to measure error in CountMatch AADT predictions with TEPS's ground truth. Found >20% fractional errors, which led to an investigation detailed in CountMatchDev2-ReproducingArmanMAE.ipynb. Conclusions:

The largest differences between my estimates and TEPS are due to TEPS's ground truth AADT formula, which leads to large errors when applied to multi-year PTC data. It works for TEPS because so many PTCs only have one year's worth of data there.
Some large errors in TEPS according to its own validation script are from improper attribution of AADT, probably somewhere in SEL_estimate.m.
We should stop comparing to TEPS and rewrite CountMatch according to our own understanding of Bagheri.

cczhu commented 4 years ago

While productionizing the fitter, made the observation that the way we estimate the MADT to AADT ratio for short-term counts is:

$\mathrm{MADT}_{pj} = \frac{\sum\mathrm{STTC}_{dij} \times \mathrm{DoM}_{dij} \times \prod_{k=i}^p GR_k}{n}$

$\mathrm{AADTprelim}_p = \frac{\sum \mathrm{STTC}_{dij} \times D_{dij} \times \prod_{k=i}^p GR_k}{n}$

If we imagine a situation where we have an STTC with only one month j worth of data, and an associated PTC with only one year's worth, we're stuck in a situation where DoM_dij / D_dij (which, recall, are estimated from the PTC) is just the PTC MADT / AADT ratio for month j. Thus, when we calculate MADT / AADT, it will exactly equal the PTC ratio, giving zero MSE (the full derivation is in CountmatchDev3-SensibleMatcherPrototype.ipynb). This breaks the minimum MSE, since it's not unrealistic for several nearby PTCs to only have one year's worth of data. The algorithm does fail somewhat gracefully, though, since it will pick the closest PTC that has zero error, rather than some random PTC. Also note that this is not a bug, but a limitation of how the estimation process works - it occurs because there's far too little data for comparing normalized monthly patterns.

Two ways to resolve this:

Follow Bagheri in only calculating the MADT to AADT ratio using the NEAREST PTC, rather than whichever PTC we're comparing ratios with. Then the only PTC that could have zero MSE is the closest station, which in the absence of more data is the station I'd pick D_ijds from.
Create more PTCs with multiple years of data through data imputation.

cczhu commented 4 years ago

We now have functionalized versions of all the CountMatch algorithms! I created:

An algorithm that reproduces the TEPs matching method as best as I can (correcting some known issues)
An algorithm that reproduces Bagheri, except that instead of only using the nearest PTC's DoM_ijd and D_ijd to make preliminary STTC MADT and AADT estimates, I follow TEPs in using whichever station I'm currently considering. This is the "hybrid" or "CountMatch default" algorithm.
An algorithm that reproduces Bagheri's MSE fitter
An algorithm that reproduces Bagheri's COV fitter

The last three algorithms all allow overriding the PTC growth factor (as in Bagheri) with the citywide PTC average growth factor (as in TEPs). Doing this reduces the predictive accuracy of the model under ideal circumstances but should improve accuracy when PTC data is too sparse for proper growth rates to be calculated (which I suspect is why Arman averages the PTC growth rates in TEPs).

To validate, I'm generating fake STTC data using the PTC stations as in Bagheri et al. For each PTC station, I draw a random STTC station, and use the months and years it has data to select a small subset of PTC data. This allows me to reproduce the annual and sub-annual pattern of STTC counts when generating fake data. Bagheri suggests generating 100+ iterations, but I'm impatient and only generated 10.

Full results of the algorithm shootout in CountmatchDev6-Shootout. Summary:

Sensitivity Test

This test checks how much estimates differ between different draws of fake data. An excellent predictive algorithm should minimize this variation.

We can measure this variation by determining the standard deviation divided by the mean, i.e. the coefficient of variation (COV), for each short term count location. We'll take 2018 as a representative year.

Shootout_COV

The TEPs method has the least COV variation, as measured by the median. The countmatch default method is the worst. The reason for this is largely because of using individual station growth rates - if we force the use of the global growth rate, we can reduce the variation down to near TEPs level.

Check Against Ground Truth

We need to examine how accurate the models are, but we can only do that when predicting for sites where we know the AADT, and which years it is known varies from station to station. I therefore rigged up a version of the fake data generator that cycles through the years, prediction AADTs for each station where a ground truth value is known.

This is computationally intensive, so only 10 sets of fake data are generated per experiment.

Here are the results for (CountMatch using the) TEPs algorithm:

Shootout_PredvsGrd_TEPs

for CountMatch:

Shootout_PredvsGrd_CM

and for CountMatch with global growth rate

Shootout_PredvsGrd_CMGGF

All plots have a common scale, which eliminates the extreme outliers produced by CountMatch without a global growth rate.

Observations:

Using individual PTC growth rates leads to large outliers (both positive and negative); eliminating it leads performance similar to TEPs default.
MAPE doesn't noticably improve between TEPs, CountMatch default with global growth rate or either of the Bagheri methods, but the dispersion of points in the plots is noticably reduced with the CountMatch default, which makes it possibly the best of all the algorithms.
There's not too much to be gained by switching from CountMatch's hybrid method to either Bagheri method, especially Bagheri COV. Here we reproduce Bagheri's own finding that the COV method has poorer accuracy.

For a more thorough analysis, see the ipynb.

cczhu commented 4 years ago

To do:

[ ] Turn MSE fitters into classes. Based on our validation, we're safe to proceed with using the CountMatch default method as our actual default method, with an option to switch to the Bagheri MSE method as needed.
[ ] Rig up a minimalist test suite just to make sure our classes are doing what they should be.
[ ] Merge with master, though we might want to consider handling #26 first, since the growth rate is so important to the overall predictive accuracy, and refactoring that module will have significant consequences to the fitter's codebase.

cczhu commented 4 years ago

Currently testing the newest CountMatch PR, and making predictions for one year using the entire FLOW database takes around 15 minutes on a core i5 workstation. That's pretty good!

CityofToronto / bdit_traffic_prophet

Minimum mean-square error "fitter" for count association #14

Sensitivity Test

Check Against Ground Truth