dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
668 stars 61 forks source link

use kd-trees for distance and geo_joins #54

Open dan-reznik opened 5 years ago

dan-reznik commented 5 years ago

wondering if to distance_join() or geo_join() two tables, one with M coords and the other with N coords, say M<=N, your curr implementation is "exaustive" i.e., it runs in O(M*N) time. if so, consider creating a kd-tree (of kNN fame) of the smaller list in O[M*log(M)] time and querying it N times for the closest point to P1,P2,...PN on the longer table. only join if the dist(P, closest(P)) < max_dist. this will run in O[(M+N)*log(M)] which will be at least M/(2*log(M)) faster than the exaustive method.