OvertureMaps / data

Overture Maps Data
https://docs.overturemaps.org
967 stars 38 forks source link

Duplicate division features in July release #191

Open skmoore opened 3 months ago

skmoore commented 3 months ago

I'm seeing duplicate division features in the July release. There are a few patterns, some of which may be expected.

The value for local_type is different, so perhaps this is expected? In this example the capital_of_divisions column is identical for both features, but the values are too long to include here

id subtype local_type name
085e1b033fffffff0143e1f3681c0468 locality suburb Bratislava
085e1b033fffffff018e7dd51bb7f7c2 locality city Bratislava

\ Another example where local_type and capital_of_divisions have different values for each duplicate

id subtype local_type name capital_of_divisions
085cf0a87fffffff01899c602517d124 locality city Kingston [{division_id=085d2436ffffffff018dac84a99372bd, subtype=country}, {division_id=085cf0acbfffffff0132b9b77708bb1e, subtype=region}]
08516260bfffffff010a9b6d0cbde49e locality town Kingston [{division_id=08516260bfffffff01d5cb63d73e2437, subtype=county}, {division_id=085a391c7fffffff01dcadc0d18a31d7, subtype=country}]
08520ed77fffffff01152a636d6595da locality hamlet Kingston [{division_id=08520e87bfffffff01f0d8333eaebd81, subtype=country}]

\ Others are basically exact matches of each other

id subtype local_type name capital_of_divisions
085b2cd0ffffffff01ec1a70e2268d2a locality city Lefkoşa [{division_id=085b39333fffffff01f145d7f9110d70, subtype=country}]
085b2cd0ffffffff012a65eb36fb45c1 locality city Levkosia [{division_id=085b39333fffffff01f145d7f9110d70, subtype=country}]
085b2cd0ffffffff01861b0866ec3f98 locality city Levkosia [{division_id=085b39333fffffff01f145d7f9110d70, subtype=country}]
085b2cd0ffffffff01aa2be6f778df73 locality city Levkosia [{division_id=085b39333fffffff01f145d7f9110d70, subtype=country}]
stepps00 commented 3 months ago

Thanks for the examples @skmoore.

Today, multiple place tags from OpenStreetMap - the local_type values you're seeing - are used to generate locality entities in the divisions theme. So for the Bratislava example, because those places are represented though multiple features in OSM with suburb and city place tags, multiple entities are generated in Overture. This is not ideal and is causing the duplicate and overlap issues you're seeing, so some changes are being planned to the ingestion pipeline as a fix.

The Kingston examples are actually legitimate entities - one in Jamaica, one in Tasmania, and one in Norfolk Island, hence they have unique capital_of_divisions values.

The issue with the Levkosia example is slightly different, as multiple entities were generated even though they all share the same place tags / local_type value. Running this query in duckdb

SELECT
    id,
    sources[1].dataset as dataset,
    sources[1].record_id as concordance_id
FROM
    read_parquet('s3://overturemaps-us-west-2/release/2024-07-22.0/theme=divisions/type=*/*', filename=true, hive_partitioning=1)
WHERE
    id in ('085b2cd0ffffffff01ec1a70e2268d2a','085b2cd0ffffffff01aa2be6f778df73','085b2cd0ffffffff01861b0866ec3f98','085b2cd0ffffffff012a65eb36fb45c1');

you'll see four unique OSM features

┌──────────────────────────────────┬───────────────┬────────────────┐
│                id                │    dataset    │ concordance_id │
│             varchar              │    varchar    │    varchar     │
├──────────────────────────────────┼───────────────┼────────────────┤
│ 085b2cd0ffffffff01ec1a70e2268d2a │ OpenStreetMap │ R16283715      │
│ 085b2cd0ffffffff01861b0866ec3f98 │ OpenStreetMap │ R2628520       │
│ 085b2cd0ffffffff01aa2be6f778df73 │ OpenStreetMap │ N1893015330    │
│ 085b2cd0ffffffff012a65eb36fb45c1 │ OpenStreetMap │ R2628521       │
└──────────────────────────────────┴───────────────┴────────────────┘

Ideally, a single entity would be maintained on Overture's end for this locality.

Both of these issues are related and similar to a discussion around localities here. There is no timeline for a fix yet, but once some action is taken, we can share a progress update. We're hoping to make some pipeline updates soon, so this should be corrected in one of the upcoming releases.

Feel free to add additional examples, they're very helpful.

skmoore commented 3 months ago

@stepps00 Thanks for the info