WIP: Custom quest to import missing locations from ATP in the Czech Republic

vfosnar commented 3 months ago

Right now I'm finishing the backend so when I'm sure things are going to be stable this can be merged. The included image is public domain (https://www.svgrepo.com/svg/491993/electricity-bill).

One thing I'm a bit stuck on but isn't a showstopper is that fun onSyncedEdit(edit: ElementEdit, id: String?) never gets called. So if the user doesn't have internet connection at the time of the edit, it doesn't get synced to my backend that otherwise only updates this data once per day.

Helium314 commented 3 months ago

@matkoniecz you are working with ATP stuff, so I thought you might be interested in this quest?

matkoniecz commented 3 months ago

Yeah, I am implementing exactly this right now :/

matkoniecz commented 3 months ago

Right now I'm finishing the backend

Is it open source or will be?

matkoniecz commented 3 months ago

missing locations from ATP

how it detects whether features are missing but present in ATP? Is it skipping low-quality spiders?

in the Czech Republic

what is the reason for such limitation? costs of the server?

matkoniecz commented 3 months ago

How user can change location compared to what ATP reports? Note that in basically all cases location reported in ATP is not good enough for OSM purposes, mismatches ranging from few meters with offset by 20m or 40m being normal.

(there are also much more offset objects, but at that point it is also getting to cases "ATP claims it exists, but it does not exist" where something is offset by 2km, 200km or 2000km)

vfosnar commented 3 months ago

Oh cool,

first things first the primary target for this project was to update already existing elements, but I realized at least half of the entries in the Czech Republic are missing.

At this point it's a bunch of python scripts bodged together but I'm slowly cleaning it up.

Is it open source or will be?

yes, it is @ https://gitlab.com/vfosnar/atpsync and https://gitlab.com/vfosnar/atpsync_backend

how it detects whether features are missing but present in ATP?

For finding already matched elements it checks if either brand:wikidata and ref matches or if the tag and value of ref:atp:<spider name> matches.

For finding previously unmatched ones it searches the radius of 100 meters. If it finds a match, for example within 20 meters, it checks double the radius, in this case 40m, to rule out possible duplicates/collisions. These are not common but they happen and need to be resolved manually. (for example both node and way have the same tags)

Is it skipping low-quality spiders?

I hand picked some spiders as there are collisions between them and a lot of the data straight can't be included in quests, i.e. Tesco has a precision on a city level. It only makes sense to monitor such data.

what is the reason for such limitation? costs of the server?

I wanted to start small, I know the Czech Republic better than the rest of the world and I'm more aware of local conventions. There is no technical reason.

How user can change location compared to what ATP reports?

When creating/editing element in SCEE they can move the node wherever they want, after that the server will match it regardless of where it's located based on it's ref:atp:<spider name>.

I'm open to collaboration but

Targeting the whole world from the start doesn't feel right.
There is a lot of invalid data. I had to modify basically every scraper I'm using to actually scrape valid informations.

What is your take on this, @matkoniecz ?

vfosnar commented 3 months ago

btw I have a (outdated) map (based on outdated source, i.e. ref:atp:<spider name> -> ref) if you want to see what I'm currently at. https://atpsync.vfosnar.cz/

matkoniecz commented 3 months ago

Targeting the whole world from the start doesn't feel right.

My plan was to target my own country at start, to allow testing quality of what is being suggested.

But with design that would allow processing worldwide dataset in future.

There is a lot of invalid data. I had to modify basically every scraper I'm using to actually scrape valid informations.

My plan for that was to import only shop name, brand and type and ignore all other tags as I worried about this.

For example I would consider https://github.com/alltheplaces/alltheplaces/issues/6943 to be prerequisite to use any opening_hours tags from ATP.

Though even top-level tags often need adjustments (I opened https://github.com/alltheplaces/alltheplaces/pull/7344 https://github.com/alltheplaces/alltheplaces/pull/7572 https://github.com/alltheplaces/alltheplaces/pull/6763 so far and reported more cases, see say https://github.com/alltheplaces/alltheplaces/issues/7600)

matkoniecz commented 3 months ago

I was just starting about how to design things so performance/costs will still allow worldwide processing (now and in future) without strain. So far I was just reminding myself about some data structures (that I used long time ago).

Are you doing anything smart here with matching OSM data and ATP data? Or maybe I am overthinking and brute force scales up well at least to country level? Though maybe it will stop working with enabling more than few spiders and one country.

After all, brute force worked well for my matcher of a single spider across Europe but running something like that 2000 times does not sound like a good idea to me. Though, maybe comparing each spider with Overpass query output for it is not so bad idea after all?

In general I would really prefer to cooperate on existing project rather than start a separate brand new one! I will look around the code to see what you did and maybe I will send some PRs. BTW, I would consider having readme (also?) in English.

And at least I can try playing with it to judge how well this works in action in StreetComplete as a quest.

vfosnar commented 3 months ago

Are you doing anything smart here with matching OSM data and ATP data?

Not much... Python is unsurprisingly the largest bottleneck rn. I'm doing an overpass query for each spider, for example KFC will search for fast foods with KFC or kentucky fried chicken in the name. I also considered searching within the brand but I was not able to write such Overpass query.

will send some PRs.

Treat this more as a POC project. If I'd want to scale I'd rather pick a better language like Rust or go. The code is already slow af and it's worth considering a rewrite.

BTW, I would consider having readme (also?) in English.

rn it's just a useless summary of a specific part of the code anyway :)

can you write me on matrix so we don't spam here? @me:vfosnar.cz

vfosnar commented 3 months ago

Or maybe better keep using a simpler language on the backend and use PostGIS to do the OSM lookups locally. This allows to optimize for specific queries and doesn't overload Overpass servers.

matkoniecz commented 3 months ago

can you write me on matrix so we don't spam here? @me:vfosnar.cz

I posted there

Helium314 / SCEE

WIP: Custom quest to import missing locations from ATP in the Czech Republic #523