Avoid more duplicate spots

tillwenke commented 1 month ago

Suggest reviewing an existing spot if a new spot is added within a e.g. 100m of an existing one.

Maybe we need a clever data structure to quickly get all spots close to a new spot.

bopjesvla commented 1 month ago

Maybe we need a clever data structure to quickly get all spots close to a new spot.

classic cs student. maybe once we have 1 million spots.

we could create a dedicated review page that quickly jumps from one possible duplicate spot to another

tillwenke commented 1 month ago

I think a dedicated page does not fit my requirements. I was thinking of a user adding a spot. if they try to add the spot close to an existing spot, we intervene, asking them if they would like to review the existing close by spot instead.

tillwenke commented 1 month ago

Somehow some users are not aware that reviewing instead of adding a spot is an option.

bopjesvla commented 1 month ago

I think I already added this before, but it's tough to explain without a lot of text. This should make deduplication unnecessary though: https://github.com/bopjesvla/hitch/issues/46

bopjesvla commented 1 month ago

You had already clustered the points, correct? I think we can use that clustering to merge points on the front-end. If you have a script that outputs (lat, lon, cluster_id) for every point, that should be easy

tillwenke commented 1 month ago

I came from reporting some duplicates (as one might see here https://hitchmap.com/dashboard.html), saw a lot of recent duplicates as well and feared that while cleaning up already new ones will spawn.

tough to solve with #46 as e.g. spots on opposite sites of a road can be quite close.

I d like to avoid text as well. How about at the end of the process:

(if there is a nearby spot) "We will add your review to this spot. Are you ok with it?"
(if there is another close by spot) if not: "Do you want to select another spot to add your review to?" -> select
if not: ok we ll keep your new spot

could live without 2nd option

bopjesvla commented 1 month ago

tough to solve with https://github.com/bopjesvla/hitch/issues/46 as e.g. spots on opposite sites of a road can be quite close.

I encountered this too, but it can be solved. I asked ChatGPT for a solution a few days ago, this is what it came up with:

import requests
import pandas as pd
from scipy.spatial import KDTree
from shapely.geometry import Point, LineString
from shapely.ops import nearest_points

# Function to query OSM Overpass API to get the nearest road and its geometry
def get_nearest_road_geometry(lat, lon):
    # Overpass API query to find the nearest highway and get its geometry
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    way(around:50,{lat},{lon})["highway"];
    (._;>;);
    out body geom;
    """
    response = requests.get(overpass_url, params={'data': overpass_query})
    data = response.json()

    # Extract road geometry (as a list of coordinates forming the polyline)
    if 'elements' in data and len(data['elements']) > 0:
        road_element = data['elements'][0]
        if 'geometry' in road_element:
            # Return the road ID and the LineString geometry of the road
            road_id = road_element['id']
            road_geom = LineString([(pt['lon'], pt['lat']) for pt in road_element['geometry']])
            return road_id, road_geom
    return None, None

# Function to check if two points are on the same side of the road
def are_points_on_same_side(point1, point2, road_geom):
    # Calculate nearest points on the road for both points
    nearest_p1 = nearest_points(point1, road_geom)[1]
    nearest_p2 = nearest_points(point2, road_geom)[1]

    # Determine if both points are on the same side of the road
    distance1 = point1.distance(nearest_p1)
    distance2 = point2.distance(nearest_p2)

    # If the signs of the distances are the same, points are on the same side
    return (distance1 * distance2) > 0

# Function to query OSM for service areas
def get_service_area(lat, lon):
    overpass_url = "http://overpass-api.de/api/interpreter"
    overpass_query = f"""
    [out:json];
    (node(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    way(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    relation(around:50,{lat},{lon})["amenity"~"parking|fuel|service_area"]["highway"~"service|rest_area"];
    );
    out body;
    """
    response = requests.get(overpass_url, params={'data': overpass_query})
    data = response.json()

    # Extract the service area ID (or other identifying information)
    if 'elements' in data and len(data['elements']) > 0:
        # Return the ID of the first matching service area
        return data['elements'][0]['id']
    return None

# Sample DataFrame with coordinates
df = pd.DataFrame({
    'x': [52.5200, 52.5201, 52.5202],  # latitudes
    'y': [13.4050, 13.4051, 13.4052]   # longitudes
})

# KDTree for efficient neighbor search
coords = df[['x', 'y']].values
tree = KDTree(coords)

# Define distance threshold
distance_threshold = 50  # 50 meters

# Find nearby points
neighbors = tree.query_ball_point(coords, distance_threshold)

# Initialize lists to store road IDs, geometries, and service areas
df['road_id'] = None
df['road_geom'] = None
df['service_area'] = None

# Query road segment and geometry for each point
for idx, row in df.iterrows():
    lat, lon = row['x'], row['y']

    # Query nearest road
    road_id, road_geom = get_nearest_road_geometry(lat, lon)
    df.at[idx, 'road_id'] = road_id
    df.at[idx, 'road_geom'] = road_geom

    # Query service area
    service_area_id = get_service_area(lat, lon)
    df.at[idx, 'service_area'] = service_area_id

# Check for each pair of nearby points
same_side_or_service_pairs = []
for i, nearby in enumerate(neighbors):
    for j in nearby:
        if i != j:
            road_id_i = df.loc[i, 'road_id']
            road_id_j = df.loc[j, 'road_id']
            service_area_i = df.loc[i, 'service_area']
            service_area_j = df.loc[j, 'service_area']

            # Check if they are on the same road and the same side
            if road_id_i == road_id_j:
                point1 = Point(df.loc[i, 'y'], df.loc[i, 'x'])  # (lon, lat)
                point2 = Point(df.loc[j, 'y'], df.loc[j, 'x'])  # (lon, lat)
                road_geom = df.loc[i, 'road_geom']

                if road_geom and are_points_on_same_side(point1, point2, road_geom):
                    same_side_or_service_pairs.append((i, j))

            # Check if both points are in the same service area
            elif service_area_i and service_area_j and service_area_i == service_area_j:
                same_side_or_service_pairs.append((i, j))

print("Pairs of nearby points on the same road side or service area:", same_side_or_service_pairs)

Dunno if it works, but something like it probably will work. Even if we make the occasional mistake, as long as clicking the spot shows the data where it was reported, all's good.

bopjesvla commented 1 month ago

Note: to check if two points are on the same side I'd probably draw a line between the points and see if it intersects with the road, don't know if signed distances really exist

bopjesvla commented 1 month ago

Yeah signed distances definitely appear to be a hallucination, other than that I think it's very close

tillwenke commented 1 month ago

I did similar things around here https://github.com/Hitchwiki/hitchmap-data/tree/main/cleaning

In addition we should come up with an idea to educate users to no further pollute the map.

bopjesvla commented 1 month ago

It's not polluting if we can handle it :)

I'm all for people logging exactly where they stood as long as it doesn't mess up the map

bopjesvla / hitch

Avoid more duplicate spots #93