NickCrews / mismo

The SQL/Ibis powered sklearn of record linkage
https://nickcrews.github.io/mismo/
GNU Lesser General Public License v3.0
14 stars 3 forks source link

Incorrect distance constraint using CoordinateBlocker #74

Closed jstammers closed 1 month ago

jstammers commented 1 month ago

I have a dataset of ~1M records containing Lat/Long coordinates that I would like to block using a CoordinateBlocker. I'm finding that I'm running into memory issues when doing this.

As an example, I've simulated some data using a grid of centroids and sampling from a 2D Normal distribution, choosing a standard deviation that ensures the overlap between clusters is fairly small

import numpy as np
import pandas as pd

np.random.seed(0)
# Parameters
n_clusters = 10000
n_in_cluster = 100
dist_between_clusters = 1000  # Distance between clusters in meters (e.g., 1 km)
lat0 = 0.0  # Starting latitude in degrees
lon0 = 0.0  # Starting longitude in degrees

# Earth's radius and conversions
earth_radius = 6371000  # in meters

# Function to calculate meters per degree latitude
def meters_per_degree_lat():
    return 111132.92 - 559.82 * np.cos(2 * np.radians(lat0)) + \
           1.175 * np.cos(4 * np.radians(lat0)) - 0.0023 * np.cos(6 * np.radians(lat0))

# Function to calculate meters per degree longitude
def meters_per_degree_lon(lat):
    return (np.pi * earth_radius * np.cos(np.radians(lat))) / 180

# Calculate meters per degree at starting latitude
meters_per_deg_lat = meters_per_degree_lat()
meters_per_deg_lon_initial = meters_per_degree_lon(lat0)

# Calculate grid dimensions
dim = int(np.ceil(np.sqrt(n_clusters)))
n_clusters_actual = dim * dim  # Actual number of clusters

# Calculate degree differences between clusters
delta_lat = dist_between_clusters / meters_per_deg_lat
delta_lon = dist_between_clusters / meters_per_deg_lon_initial

# Standard deviations for clusters in degrees
sigma_lat = (dist_between_clusters / 2) / meters_per_deg_lat
sigma_lon_initial = (dist_between_clusters / 2) / meters_per_deg_lon_initial

# Initialize list to hold points
points = []

# Generate grid of centroids and points around them
for i in range(dim):
    for j in range(dim):
        # Compute centroid latitude and longitude
        lat = lat0 + i * delta_lat
        lon = lon0 + j * delta_lon
        # Update meters per degree longitude and sigma_lon for current latitude
        meters_per_deg_lon_current = meters_per_degree_lon(lat)
        sigma_lon = (dist_between_clusters / 2) / meters_per_deg_lon_current
        # Generate points around centroid
        cluster_lats = np.random.normal(loc=lat, scale=sigma_lat, size=n_in_cluster)
        cluster_lons = np.random.normal(loc=lon, scale=sigma_lon, size=n_in_cluster)
        # Combine latitudes and longitudes
        cluster_points = np.column_stack((cluster_lats, cluster_lons))
        points.extend(cluster_points)

# Convert points to a DataFrame
df = pd.DataFrame(points, columns=['latitude', 'longitude'])

I can use the sklearn.neighbors.BallTree class to calculate the number of points within a given radius of each point as follows

import numpy as np
import pandas as pd
from sklearn.neighbors import BallTree

# Convert coordinates to radians
df['lat_rad'] = np.radians(df['latitude'])
df['lon_rad'] = np.radians(df['longitude'])
coordinates = df[['lat_rad', 'lon_rad']].to_numpy()

# Build the BallTree
tree = BallTree(coordinates, metric='haversine')

def count_in_distance(x):
    radius_in_radians = x / earth_radius
    indices = tree.query_radius(coordinates, r=radius_in_radians)
    return [len(ind) - 1 for ind in indices]

df['count_within_1km'] = count_in_distance(1000)

Which in this case gives around 300 points on average image

On my machine, it takes around 40s to calculate this. However, when I try to block these using CoordinateBlocker I run out of memory.

import ibis
from mismo.lib.geo import CoordinateBlocker
con = ibis.get_backend()
t = con.create_table("clusters", df, overwrite=True)
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t, t).cache() # runs OOM

In this example, I would expect around 300M pairs but blocked.count().execute() returns around 29 Billion records.

Restricting to just the first 10k records, I can see there are some that have a much larger than expected distance, which may be related to how the coordinates are being used to block records together

from mismo.lib.geo import distance_km
coord_blocker = CoordinateBlocker(lat="latitude", lon="longitude", distance_km=1)
blocked = coord_blocker(t.limit(10000), t.limit(10000))
blocked = blocked.mutate(distance=distance_km(lat1=blocked.latitude_l,lat2=blocked.latitude_r,lon1=blocked.longitude_l,lon2=blocked.longitude_r))
blocked

┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ record_id_l ┃ record_id_r ┃ latitude_l ┃ latitude_r ┃ longitude_l ┃ longitude_r ┃ distance  ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64       │ int64       │ float64    │ float64    │ float64     │ float64     │ float64   │
├─────────────┼─────────────┼────────────┼────────────┼─────────────┼─────────────┼───────────┤
│           0 │        9997 │   0.007977 │   0.005547 │    0.008468 │    0.890071 │ 98.030167 │
│           1 │          95 │   0.001809 │   0.003195 │   -0.006060 │   -0.000771 │  0.607952 │
│           2 │          95 │   0.004426 │   0.003195 │   -0.005713 │   -0.000771 │  0.566254 │
│           3 │        9997 │   0.010133 │   0.005547 │    0.004359 │    0.890071 │ 98.487991 │
│           4 │          95 │   0.008445 │   0.003195 │   -0.005275 │   -0.000771 │  0.769126 │
│           5 │        9999 │  -0.004419 │  -0.005760 │    0.008740 │    0.888860 │ 97.865038 │
│           6 │          95 │   0.004296 │   0.003195 │   -0.001860 │   -0.000771 │  0.172167 │
│           7 │         182 │  -0.000684 │  -0.003129 │   -0.003361 │   -0.001149 │  0.366605 │
│           8 │        9999 │  -0.000467 │  -0.005760 │    0.008647 │    0.888860 │ 97.877034 │
│           9 │        9997 │   0.001857 │   0.005547 │    0.006657 │    0.890071 │ 98.231970 │
│           … │           … │          … │          … │           … │           … │         … │
└─────────────┴─────────────┴────────────┴────────────┴─────────────┴─────────────┴───────────┘
jstammers commented 1 month ago

I've tried varying the distance_km value to see if there's any pattern there. It doesn't scale linearly as I would expect, so I wonder if this is related to how the coordinates are being hashed to approximate the distance between two locations

image

NickCrews commented 1 month ago

Thanks @jstammers, almost definitely a bug. I'll take a look.

NickCrews commented 1 month ago

It looks like a bug in how ibis compiles the floor divide, it doesn't preserve the needed parenthesis:

reg = ibis.literal(10) / (1 / ibis.literal(2))
floor = ibis.literal(10) // (1 / ibis.literal(2))

print(ibis.to_sql(reg))
SELECT
  10 / (
    1 / 2
  ) AS "Divide(10, Divide(1, 2))"

print(ibis.to_sql(floor))
SELECT
  CAST(FLOOR(10 / 1 / 2) AS BIGINT) AS "FloorDivide(10, Divide(1, 2))"

Will link here to the ibis issue/PR that I will make.

NickCrews commented 1 month ago

PR up at https://github.com/ibis-project/ibis/pull/10353

NickCrews commented 1 month ago

In the meantime until that is fixed, I added a workaround on our side in https://github.com/NickCrews/mismo/commit/4598a9e8a45759617e58ab45b1332436da3446ae. So this should be fixed, please let me know if not!

jstammers commented 1 month ago

Thanks for fixing this! For reference, this is what I have now, which is much more manageable

image