Closed jstammers closed 1 month ago
I've tried varying the distance_km
value to see if there's any pattern there. It doesn't scale linearly as I would expect, so I wonder if this is related to how the coordinates are being hashed to approximate the distance between two locations
Thanks @jstammers, almost definitely a bug. I'll take a look.
It looks like a bug in how ibis compiles the floor divide, it doesn't preserve the needed parenthesis:
reg = ibis.literal(10) / (1 / ibis.literal(2))
floor = ibis.literal(10) // (1 / ibis.literal(2))
print(ibis.to_sql(reg))
SELECT
10 / (
1 / 2
) AS "Divide(10, Divide(1, 2))"
print(ibis.to_sql(floor))
SELECT
CAST(FLOOR(10 / 1 / 2) AS BIGINT) AS "FloorDivide(10, Divide(1, 2))"
Will link here to the ibis issue/PR that I will make.
In the meantime until that is fixed, I added a workaround on our side in https://github.com/NickCrews/mismo/commit/4598a9e8a45759617e58ab45b1332436da3446ae. So this should be fixed, please let me know if not!
Thanks for fixing this! For reference, this is what I have now, which is much more manageable
I have a dataset of ~1M records containing Lat/Long coordinates that I would like to block using a
CoordinateBlocker
. I'm finding that I'm running into memory issues when doing this.As an example, I've simulated some data using a grid of centroids and sampling from a 2D Normal distribution, choosing a standard deviation that ensures the overlap between clusters is fairly small
I can use the
sklearn.neighbors.BallTree
class to calculate the number of points within a given radius of each point as followsWhich in this case gives around 300 points on average
On my machine, it takes around 40s to calculate this. However, when I try to block these using
CoordinateBlocker
I run out of memory.In this example, I would expect around 300M pairs but
blocked.count().execute()
returns around 29 Billion records.Restricting to just the first 10k records, I can see there are some that have a much larger than expected distance, which may be related to how the coordinates are being used to block records together