KDTree-based nearest neighbour search for permanent counts

Both PRTCS and KCOUNT require distance information - PRTCS uses angular distance (or whatever you would call distances calculated without converting lat-lon to something sensible) to determine N nearest neighbours, while KCOUNT directly uses the distance.

There are ~150 PTC locations and ~9000 STTC locations across all directions and all years with data. There are ~16,000 major or minor arterial or collector road centreline segments. Storing the full pairwise distance matrix for all arterial centreline segments would require a table with ~260 million rows or a 2D matrix of 960 MB in NumPy, both of which are feasible to store but infeasible to parse (. TEPs-I solves this problem by only considering neighbouring links within 2 km of any given centreline segment (line 110 of data_prep_kriging.m, b=b0((find(b0(:,15 )<2)),:);) in KCOUNT, and using a nearest neighbour search in place of a distance matrix in PRTCS.

We should adopt a similar strategy for CountMatch - calculate N nearest neighbours either using a tree for Euclidean or Manhattan distances or routing for network distances. The former can be done quickly locally in Python using a KDTree. For the latter, only N nearest neighbours (and potentially their distances) should be stored, but this raises the problem that the database would have to calculate neighbours every time short term counts from a new location are introduced. Possible solutions:

In the short-term, we'll continue using Euclidean or Manhattan distances between count locations, since that's what PRTCS does.
In the medium-term, the code should raise an error prompting the user to recalculate the network distance data if new count locations have been introduced or if N changes.
In the long-term, we should put some thought into having the model be able to auto-refresh its source data if missing values are encountered.

CityofToronto / bdit_traffic_prophet

KDTree-based nearest neighbour search for permanent counts #13