db layout and schema discussions

defgsus commented 3 years ago

Hello @jklmnn and @kiliankoe,

i'd like to continue the discussion about the layout of the database here. https://github.com/offenesdresden/ParkAPI/issues/224

As it's not completely intuitive to read the relations from the code i will write down the basics. The CamelCase names are the django models which translate to database tables.

ParkingData
- timestamp (datetime)
- lot_timestamp (datetime)
- lot (ParkingLot)
- status (str) # open|closed|unknown|nodata|error
- num_free (int)
- num_total (int)
- ...
ParkingLot
- osm_id (str) # osm ID if available
- geo_point (gis)
- geo_polygon (gis)
- lot_id (str) # unique permanent identifier of parking lot
- pool (Pool)
- city (City)
- lot_type (str) # street|garage|underground|bus
- max_num_total (int)
Pool
- pool_id (str) # unique permanent identifier of parking pool
- name (str)
- source (str)
- provider (str)
- license (str)
- ...

ParkingLot and all below are based on OSMBase model and support automatic retrieval of osm data

City
- osm_id (str) # unique permanent identifier
- geo_point (gis)
- geo_polygon (gis)
- name (str)
- ...
- state (State)
- country (Country)
State
- osm_id (str)
- geo_point (gis)
- geo_polygon (gis)
- name (str)
- ...
- country (Country)
Country
- osm_id (str)
- geo_point (gis)
- geo_polygon (gis)
- name (str)
- ...

A few notes:

Pools are not in the code yet
Further changes to existing tables are quite easy to implement in Django. E.g. adding an optional field simply requires to add it to the Model code and call ./manage.py makemigrations to publish it.
The postgis requirement for the database adds minimally to the development and CI effort but it will allow to efficiently query by location later on.
The City, State and Country tables are just a proposal. They have the only purpose of allowing granulated queries later on.
Identifying locations by osm_id instead of names is certainly more duplicate-proof. And it allows to query geo and other information later on. No need for the scraper to provide more than the osm_id. It does not exist for many parking lots, though.
A ParkingLot always requires a City right now. That's the minimum requirement which should not be to hard. Though, a City currently does not require State or Country. All of this data can be tied together later on in the django admin. Also note that City links to State and Country. That means one can actually create nonsense data in the database.

Another way of doing it is to require the ParkingLot -> City -> State -> Country mapping and to require the scraper to provide all the osm_ids for it. Still it would be possible to end up with inconsistent data because someone might have a typo in the IDs or bad humor.

There is a nominatim endpoint (details/) which displays the address hierarchy of any object. That would allow to get State and Country just from the city osm_id. Unfortunately it's not allowed to read from it in an automated fashion. I did not find another way of getting the osm hierarchy yet. EDIT: Guess it's quite reliably possible by using the features.properties.address property of the nominatim geojson when addressdetails=1.

Another option: scraper needs to supply city osm_id and nothing more, the City does not immediately require a State and basically an admin has to tie all missing links together at some point.
And finally: The normalized database layout is something different than the type of data that a scraper is required to provide. I think we should be nice to contributers and make it as easy as possible. The current layout can be found in park_data/models/_store.py and the unittest

defgsus commented 3 years ago

I tried creating the OSM location hierarchy by searching nominatim for all relevant parts of an address. It's on the feature branch feature/osm-locations.

There is a new django app osm_locations and all locations share the same database table. I added two parking lots (in Germany and Belgium) by only supplying the osm_id and ran the command ./manage.py pa_update_osm_locations which grabs all relevant data from nominatim (code).

Here's a screenshot of the django admin of the result:

It kind of works, but:

The type of place in the table is the geojson.feature.properties.extratags.linked_place property which is not always present, so we don't reliably know if something's a city or a district or state.. Though, it's probably encoded in the place rank from geojson.features.properties.place_rank. Looking at that and comparing the place rank with Thüringen then Liège is probably not the city of Lüttich but the Province of Lüttich which means the city was somehow skipped in the automatic search (and this is an ugly repair task for an admin)
Which is probably because the search is queried by the string value of the address fields..
Another problem is that there is no efficient database query to get, e.g. the country of a parking lot. You'll need to follow the parent relation until there is no parent anymore..

In spite of these facts i propose to follow the original design with individual tables for cities, states and countries (maybe also districts and suburbs) which are all interlinked, e.g.:

ParkingLot
- city (City)
- state (State)
- country (Country)
City
- state (State)
- country (Country)

a.s.o...

It makes forward and backward queries easy. The data can still be queried automatically and missing cities like in the above case can be detected easily and added without untying an existing parent relation.

EDIT: About the place_rank parameter:

An osmnames document says: The place rank indicates how important a element is (lower means more important). A continent for example has a place_rank of 2, which is the lowest place_rank possible. The place_rank is either the double of the admin_level, if the admin_level is set, or a value depending on the type of the element. The mapping can be found here.

The file disappeared in August 2018 and today maybe this describes it more precisely? Well, it's complicated. And what an amazing project, btw.

jklmnn commented 3 years ago

Just some short ideas, I didn't have the time to take a deeper look yet. Tbh I'd like to drop City, State and Country completely. Data source and data publishing should be independent. So in theory all we need is pools. For the current API (which unfortunately needs to stay for a while since people actually started being compatible to it) we need to somehow get a City and a Region. The Region is of no use other than that it is displayed in some of the apps. I have yet so see any practical use and it's not always possible to get it from all sources. Unfortunately we can't easily update all apps so we have to keep it in the API and at least fill on some dummy value. The city is our current grouping for the apps. I'd like to drop it from the database and rather define it e.g. by a location with a radius. This would allow us to keep the current API and still provide data from sources (Pools) that do not adhere to the strict city structure (e.g. the DB provides parking data for their garages, but they provide a single source for all garages in Germany, we currently can't really use that).

defgsus commented 3 years ago

@jklmnn With Pools that do not adhere to the strict city structure you mean something like www.apag.de? In that case i'd think apag is the pool id and each lot can have it's own city.

I agree with your thought that exact location is the most important thing. That's why i settled on postgis. I havn't tried it yet but it's supposed to make those location queries a delight.

Anyways, that whole administrative hierarchy thing is just an idea. Practical use is not so great if i think about it twice, except to show off in the statistics page like look we have so many countries, states, cities.

But i still like the idea of getting an osm_id from the scraper and everything else can be determined from that. But i think the city-level is the lowest one we can expect to exist at every possible place.

I actually like your design of supplying a geojson file with every scraper but i've seen that you have put them together yourselves in cases. That's not something i would require contributers to do, A lat/lon for each lot would be enough..

jklmnn commented 3 years ago

With Pools that do not adhere to the strict city structure you mean something like www.apag.de? In that case i'd think apag is the pool id and each lot can have it's own city.

Yes, I also mean https://data.deutschebahn.com/dataset/api-parkplatz.html.

The reason for the geojson file is mainly that many data sources do not provide sufficient information (such as the location, total lot count, etc). We need a way to match the static data we have to the data provided by the source. I can't think of a way to generally automate that. What do you do if you have a website with just names and a free count?

Tbh I haven't looked into osm_id yet so I can't say anything about this. About the structure all a lot needs is a location, everything else such as city or state can be derived from that if required. The only exception from that may be the address. The geolocation may be the center of the lot sometimes which makes it difficult to infer an address. The idea is that the address we provide is either the one from a (more or less) official source or we try to choose an address that matches with the actual physical entrance of the lot. We don't want people to end up at the back side of a parking garage if they use our data to navigate.

defgsus commented 3 years ago

The reason for the geojson file is mainly that many data sources do not provide sufficient information (such as the location, total lot count, etc). We need a way to match the static data we have to the data provided by the source. I can't think of a way to generally automate that. What do you do if you have a website with just names and a free count?

I understand. It's just a tiny discussion about the format then. But i think supplying a well-formatted example geojson and some docs should enable contributers to supply them.

One thing bugs me: The case that a scraper might find a new lot on a website, e.g. on apag, db-parking or here or basically anywhere.

a) If the configuration is fixed in a geojson file, it can not publish the lot until someone updates the file.

b) If the configuration is auto-generated from available (website/api-)data, the scraper might publish something erroneous.

I prefer a over b, though it would be good to record that event in the database or send a mail.

Agree with your point about a good address. Think it must be supplied by hand. Just a note: The OSM project does not have yet permanent IDs for things. The osm_id is permanent for an object but the object might be split into several by editors later on. That won't happen for cities and larger entities, i guess, but it could be problematic for streets, and parking lots.

So the required lot meta-info is!?

pool (id/link to pool with license, etc...)
id (unique permanent lot id)
lat/lon (potentially of the entrance)
lot_type (garage/lot/...)
address (free text? or normalized country-specifc?)
city (for backwards compatibility)
region (for backwards compatibility)

EDIT: The nominatim reverse search will probably get us the city and region name for every lat/lon without problems. The max zoom choice-box on the website also explains the place_rank variable mentioned above.

defgsus commented 3 years ago

Hey guys, i'm considering this situation a bit like: i'm the overenthusiastic engineer and you are the valuable critiques that pull over most of the things i developed in frenzy. Which is not bad, you know.

So, i've removed all the OSM and location-hierarchy stuff for now. The current master branch actually holds a working prototype. Please check the README for details.

There is still a lot of stuff to discuss but maybe we should just call at some point.

jklmnn commented 2 years ago

Thanks! I'll take a look at it. Sorry for the late replies. I still have to take lots of vacation days in December so I might be a bit more active here then ;) I just took a quick look over the Readme and the scraper for Dresden, impressive work! One thing I noticed is that you did some changes to the public API. We still have to support the old API but given the similarities between both I think that should be an easy task. I hope to be able to take a more detailed look in the next few days.

defgsus commented 2 years ago

Dear JK! No worries. As long as i get a few signs of activity i'm okay, Actually i would have been involved in a new project this week but it was postponed so i had the time..

The old API will be re-implemented. Currently i'm just having some fun with django-rest-framework. Although fun is not exactly the right word..

The required parameters that a scraper must supply are in the structs.py file. It's really not much, not even a city ;) The license will be accompanied by attribution source and url as in your code. Can you tell me, what means active_support and aux (found in Dresden.geojson)

jklmnn commented 2 years ago

active_support is a flag that can influence the default behavior of apps. It's basically saying that we're actively trying to fix that city as soon as possible if it is broken. If it is set to False the city may or may not be broken (and or fixed) so apps can decide not to show it by default. aux for Dresden is a mapping identifier that we were using for an actual API we had access to. Unfortunately this API was part of a research project about Open Data so it got turned off once the project was deemed "successful" (say people were actually building things upon it). That's the understanding of Open Data we sometimes have to deal with...

defgsus commented 2 years ago

Just about this nominatim hierarchy. It is really complex,.. and someone in Dresden is working on the implementation side ;)

jklmnn commented 2 years ago

A short comment to active_support and aux. I think we can drop aux entirely from the data. If we ever get an API again I think it might be a bit different. In the worst case we either have to do some manual work again or dig out a commit where it is still in the data. However I'm skeptical that this will happen. About actice_support. I'm not sure if we want to keep it. So far when a city had it set and we weren't able to fix it we just turned it off. If we could easily fix something we usually did this anyway. We have to keep it for the old API but there it could be set to True as a default.

defgsus commented 2 years ago

In general, the PoolInfo and LotInfo (and the resulting geojson) data is only an initial fixture for the database. They might update values that are null but they won't overwrite existing values in the DB, because ... It's actually quite comfortable to edit stuff in the admin interface. E.g. it would be easier to just turn active_support on or off in the admin (just imagine the flag would be actually there in the screenshot):

The problem, of course, is that the database state diverges from the initial fixtures. Don't have a good solution for that, yet, except to not edit in the admin in general and always update from the fixtures. It would also be possible to render the PoolInfo and geojson files from the database state to back-port it to the scrapers.. But it sounds like a messy data retention.

About active_support in special: Right now, the database is storing scraping errors in the database as well. E.g. the typical soup.find("elem").text will raise an exception if elem is not in the html anymore. This is stored to the database with the corresponding pool_id. It would not be hard to implement a daily routine that collects errors and lot data where num_free=None and sends a warning email. The active_support flag can be a database-only flag and does not need to be part of the scraper fixtures, i think.

It can also be determined automatically. or, we could publish an up-time percentage for each pool and lot.

jklmnn commented 2 years ago

To be honest, I'm not to thrilled about the admin interface. Having to support a web interface on our production server is something I'd like to avoid as it comes with additional security requirements and administration overhead.

Sending a message in case of an error is a good idea. We currently have a bot running that sends a message to the matrix channel if a cities data is older than a day (you don't want to send a message immediately, sometimes the data providers are doing just maintenance, etc, and secondly, our response time isn't fast enough anyway for anything that fixes itself within a day).

Publishing an uptime percentage (or data quality measurement) for pools and lots is a good idea. Uptime doesn't make sense for lots, maybe rather some statistics on how often they're updated. Sometimes a data source is "active" and provides data, but the timestamp provided by that source is old. Separately the reliability of a data source could be expressed by a metric that shows if the source was reachable and if we were able to read valid data.

defgsus commented 2 years ago

Okay, we can drop the admin. I agree simply because of the duplicate data problem. I don't think it creates more overhead on the maintenance side (apart of trusting the users that have a login). The django server needs to run anyways to provide the API endpoints and the admin is an integrated part of django. We never had security issues in our customer projects but then again, django isn't a huge attack target. Not like PHP projects ;-)

jklmnn commented 2 years ago

Having a long site that requires a separate subdomain and certificates (or a vpn) is already some extra I don't really want to have. Especially given the fact that we're probably not using it 99% of the time. The most secure interface is one you don't have. We already have SSH so that should be sufficient ;) The API endpoints will be routed to the reverse proxy we already have.

defgsus commented 2 years ago

See it the same way. No interface == most secure. It's really not necessary and actually determines the source of truth for the data: it's the scrapers (or their geojson files).

However. Now i'm also working enough that i regularly leave the laptop in the office when i'm done. But it won't always be like that. I think we can do the transition, it's just no speedy sprint ;)

Currently the biggest differences between the current ParkAPI and the rewrite are the lot_ids and the missing region. I'm generating the lot IDs a bit differently, e.g: dresdenlangebrueck is now dresden-langebruck (but i kept the original string-stripping function in the utils folder)

I guess it's not completely knowable what people and apps do with those IDs. I do like the - separators, but then again, i understand that it is some trouble to keep compatibility (also with the archive filenames).

The region string is only supported for Dresden? If so, we may just hard-code it into the api v1 responses. Or we could actually allow an extra set of key/value properties (from the geojson or the scraper) and deliver that in the api?

Just some thoughts.. Have a good time!

jklmnn commented 2 years ago

The lot ids are just used internally. At least none of the applications I know uses it (there is a single hard coded exception in our website, but that would be an easy fix). Initially we wanted to use some kind ID that is provided by the data source. But this led to a few (I think just 2) cities having different lot_ids while the rest just uses the names stripped and concatenated. I think we should only provide our own ID format publicly to keep it consistent. We should store the lot ID of the data source as an optional property (we may need it to match meta data).

Yes, the region is only relevant in Dresden. Mainly because Dresden was the first data source (hence the name ParkenDD) and their website had a region structure at the time. We have to keep the JSON key though, just to make apps not crash. It would even be valid to just set it to the city name. I don't think that there is any relevant information in the region that anyone cares about. We could also use the ZIP code of the lot.

defgsus commented 2 years ago

This message does not really fit into the issue's topic but i like the diary character of this thread.

I've started to implement the v1 API. Apart from the lot_ids and regions, as discussed earlier, this implementation does pretty much the same. What is really missing is the timespan v1.0 and the forecast. timespan v1.1 is working for historic values. All relevant code is in web/api_v1/. I tried to tie everything to the rest-framework as much as possible so that the API documentation can be generated automatically but the generated schemas are not complete yet.

The original api root path which lists the city-dictionary is a bit tricky and we have to test it once all original scrapers are migrated. Code and description here.

To make it eventually replace the original API server behind your proxy we probably need to move the url paths. Currently they are behind /api/ but it's no problem to move them to /. If we leave out the admin interface then there is only one reason remaining to put nginx (or something similar) in front. The /api/docs path delivers the API documentation and live-executable-thingy like in this screenshot:

It requires some static javascript and css which must be delivered by a web server. But we can also drop that interface and just point people to the auto-generated /api/swagger.json and they can load this into postman or similar tools.

Thanks for listening, my dear diary. Now back to work ;-)

kiliankoe commented 2 years ago

An explorable swagger API interface like that actually sounds pretty great!

defgsus commented 2 years ago

Yes, it's kind of fun. You still have to type in iso-formatted date strings, though ;-)

Configuring the rest-framework interfaces is not so funny. E.g. declaring the return status code for the /coffee endpoint probably requires overloading the default schema class and patching it somehow. Well, it's some work but it's worth the trouble... i hope.

kiliankoe commented 2 years ago

Yeah, don't worry about that one 😅

defgsus commented 2 years ago

Hehe, i'm not deeply concerned with that coffee endpoint. Just at times i thought this rest-framework would be more intuitive...

jklmnn commented 2 years ago

Don't worry about the forecast. If we set the forecast for all cities to false we can drop the forecast API for the time being. We don't have any current forecasts and unless we either find someone with knowledge and motivation to do it for us or we gain the knowledge ourselves, we won't have any forecasts in the near future anyway. I think we could also drop timespan v1.0. I'm not aware of anyone using it and afaik it operates on forecast data which we don't have. We should support timespan v2.0 with historical data though.

Having swagger hosted on our own site shouldn't be a problem. And having it is certainly a big improvement in terms of API documentation (we should really do this for the forecast API, I always forget how to use it and I implemented it).

ParkenDD / ParkAPI2

db layout and schema discussions #1