Closed gboeing closed 7 years ago
@smmaurer I'll benchmark number 1 later this week. I wrote the great circle routine myself, and it's all vectors, so it's fast. I'm using a scipy routine for the euclidean distance matrix, and I'm not sure how they implemented it. If their routine isn't particularly fast, it'll be easy enough to re-write it myself fully vectorized with numpy.
Regarding number 2, I liked the hierarchical index because it seems more pandas-like, makes for fast indexing, and matches the (presumed) structure we'd want to use in a relational DB (i.e., columns of origin_key, destination_key and distance -- with indices for the two keys). We can inspect the memory requirements of each format as part of the decision of how to represent the data. That said, it's trivial to unstack the dataframe into an i x j matrix instead of a stacked, multi-indexed vector.
I'll check out the above and lastly will add some more tests of the distance matrix module before doing a merge (probably later this week).
@gboeing Ok, great! I think we're on the same page about number 2 -- I agree that a hierarchical index seems better. The distance matrix demo notebook currently generates an i x j CSV, but we can change it to [i * j] x 3, with columns for the origin key, destination key, and distance.
Thanks again for putting together these distance calculations! Later this week I'll take a look at integrating them into the merging/sampling code..
@smmaurer regarding benchmarking euclidean vs great circle calculations... the wall time in the notebook was actually 34 ms (euclidean) vs 1.47 s (great circle). So great circle takes about 43x longer to run. So, the scipy routine seems sufficiently fast and I'll leave that as is.
@gboeing Oh, that makes more sense. Sorry for misreading it!
We do not need to merge this PR yet... I'm just opening it now for a place to discuss/comment.