ParkenDD / ParkAPI2

Rewrite of `offenesdresden/ParkAPI` with Django
MIT License
3 stars 0 forks source link

back to legacy lot ids & add more original scrapers #4

Open defgsus opened 2 years ago

defgsus commented 2 years ago

All lots need to have the same ID as it was generated in the original ParkAPI by the geojson wrapper. (As discussed in issue #1)

In essence that means:

branch: feature/legacy-ids

defgsus commented 2 years ago

Dear @jklmnn, adding the original scrapers is some work. It can take more than an hour for one city. However, it's progressing. I'm testing everything properly, replacing http with https and for the meta-infos i usually do a merge of scraped data and the original geojson files. E.g. in the Freiburg scraper (in get_lot_infos) the original ParkAPI geojson is downloaded from github and combined with the geojson of the Freiburg server. (Once the new geojson file is written, the get_lot_infos method is not called anymore and the code goes into an obsolete state - not needed until maybe a new lot appears within the pool. Though, once the geojson file is edited by hand this becomes more complicated..)

I also update addresses if the website supplies more complete addresses, like an added zip code. And add public or source urls for each lot where available.

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

For Dresden i scraped the geo coordinates from the website when available and used the ParkAPI geojson if no coords are listed. The website coordinates have more digits so i thought this might be a good thing. But i guess it's possible that you and other contributors have picked more useful coordinates by hand, so this needs to be reviewed (not only for Dresden).

Anyways, I do my best (to the best of my knowledge) to integrate the original scrapers and upgrade the meta-info where possible.

Also wrote the Frankfurt opendata people about their outage (It stopped working on 2021/12/17)

Boy, i'm really looking forward to get this project in production!

Best regards and a happy new fear

jklmnn commented 2 years ago

Great work!

Also i'm a bit more strict about the nodata status, or rather the collected numbers. If there is no free spaces or capacity number that can be scraped the values are set to None instead of zero.

This is generally a good idea. However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

defgsus commented 2 years ago

However I can't say for sure if we can keep this if it goes into production. It might cause problems with legacy clients.

Yes, replacing Nones with zeros in v1 api should be no problem. In the dumps, snapshots with None can probably just be skipped.

There are incompatibilities with some lot_ids, though. And other tricky stuff ;) I'll implement the remaining scrapers and then do a scripted comparison with api.parkendd.de

Then, we certainly have some stuff to discuss and find compromises

defgsus commented 2 years ago

The Frankfurt case: https://www.offenedaten.frankfurt.de/blog/aktualisierungverkehrsdaten

From the email: ... Sobald vom Hersteller ein entsprechender Sicherheitspatch eingespielt wurde,..

hehe

defgsus commented 2 years ago

Okaaayyyhhh, here is the first attempt to compare api.parkendd.de against ParkAPI2/api/v1. I got used to call the former ParkAPI1 (or pa1) and the latter ParkAPI2.

https://github.com/defgsus/ParkAPI2/wiki/v1-api-comparison

Just compared the 'city' metadata, not the lots. It's complex enough already. You can have a look if you like. I'm still preparing a more readable document with specific compatibility issues.

One thing is sure. Using names for IDs will remain to be problematic. They do actually change occasionally.

jklmnn commented 2 years ago

Sorry for the late reply. The problem with the lot IDs is that not all sources have real IDs, so we need to keep some kind of fallback. In the end, if there is no unique persistent ID and the data source decides to change the name, there isn't really anything we can do. We could use the location of some sort though. This is based on the assumption that a parking lot can't easily relocate itself and if it does we can safely assume that it is a different one. This would also be useful if someone wants to use this data for analysis, since a different location might have implications to the traffic around the parking lot.

defgsus commented 2 years ago

Yes, it's complicated with those IDs. I'm really just picky because of later statistics use. Your location idea sounds quite good in this regard.

For daily use it's probably no problem if a lot name changes. Apart from the fact that it is not associated to it's former .geojson entry anymore which, in ParkAPI2, would exclude it from the API v1, because it has no location and therefor no associated city.

With the right measures and follow-up maintenance this can be somewhat managed.

When porting your scrapers i found permanent IDs on some websites but with the current data structure it's not possible to switch to those IDs for identification while keeping the original IDs (from the lot names) for compatibility.

I found so many little compatibility challenges during the port that it felt like real work. Well, at least i spent a couple of real working hours ;)

In the midst of it i started writing the following overview. There are things i wanted to add later but i simply forgot them.

General changes to scrapers

(no specific order, numbers are just for communication)

  1. added lot_type "bus" which should be excluded in api by default. Just for statistics..
  2. all http urls are changed to https. Even scraped links to individual lot pages are adjusted if needed.
  3. removed Pool's public urls that just point to www.<city>.de. If possible, they got replaced by something like www.<city>.de/parken/. General url logic: if a Pool's public_url is scraped, source_url is left empty.
  4. Added public_url to all lots that have an individual public webpage.
  5. City names are queried from nominatim reverse search using each lot's coordinates. The coordinates of a city in api.parkendd.de/ are the centers of the city polygon as returned by nominatim. The original values from the ParkAPI1 geojson files are ignored because there is not a particular pool -> city mapping.
  6. A new property of a lot is live_capacity. It simply means: If there is a capacity number on the website it will be scraped with every snapshot. If not, the static capacity from the .geojson file is used and live_capacity should be False to signal that the capacity number is static and might not reflect the true capacity at any point in time.
  7. Some cities have fewer lots now, judging the API comparison. Need to find out for each lot what is going on there... The problem is that the missing lots might not be in the new .geojson files (if they have been re-rendered by scraping the page).

Individual scraper changes

That's it for now.

Please let me know what you think and let us progress, slowly..

jklmnn commented 2 years ago

I just checked the available cities after our current outage, and the only city I can see missing is Hanau. So after we add this I'd say we can close this issue.