UDST / choicemodels

Python library for discrete choice modeling
https://udst.github.io/choicemodels
BSD 3-Clause "New" or "Revised" License
74 stars 33 forks source link

calculate distance matrices to sample weighted by distance bands #6

Closed gboeing closed 7 years ago

gboeing commented 7 years ago

We do not need to merge this PR yet... I'm just opening it now for a place to discuss/comment.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling 775308970d678ae12c415e8257a5bc53f5546785 on gboeing into on master.

gboeing commented 7 years ago

@smmaurer I'll benchmark number 1 later this week. I wrote the great circle routine myself, and it's all vectors, so it's fast. I'm using a scipy routine for the euclidean distance matrix, and I'm not sure how they implemented it. If their routine isn't particularly fast, it'll be easy enough to re-write it myself fully vectorized with numpy.

Regarding number 2, I liked the hierarchical index because it seems more pandas-like, makes for fast indexing, and matches the (presumed) structure we'd want to use in a relational DB (i.e., columns of origin_key, destination_key and distance -- with indices for the two keys). We can inspect the memory requirements of each format as part of the decision of how to represent the data. That said, it's trivial to unstack the dataframe into an i x j matrix instead of a stacked, multi-indexed vector.

I'll check out the above and lastly will add some more tests of the distance matrix module before doing a merge (probably later this week).

smmaurer commented 7 years ago

@gboeing Ok, great! I think we're on the same page about number 2 -- I agree that a hierarchical index seems better. The distance matrix demo notebook currently generates an i x j CSV, but we can change it to [i * j] x 3, with columns for the origin key, destination key, and distance.

Thanks again for putting together these distance calculations! Later this week I'll take a look at integrating them into the merging/sampling code..

gboeing commented 7 years ago

@smmaurer regarding benchmarking euclidean vs great circle calculations... the wall time in the notebook was actually 34 ms (euclidean) vs 1.47 s (great circle). So great circle takes about 43x longer to run. So, the scipy routine seems sufficiently fast and I'll leave that as is.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling 9bf64520ae5921c16d403bab00e9f05ce2c30ee0 on gboeing into on master.

smmaurer commented 7 years ago

@gboeing Oh, that makes more sense. Sorry for misreading it!

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling efe9dfe9d85ff13e2c3cf68edf3000b03d1fac50 on gboeing into on master.

coveralls commented 7 years ago

Coverage Status

Changes Unknown when pulling d1552ccdd10bf0965724104aa8936e722c590a5d on gboeing into on master.