Optimize certain queries?

rduivenvoorde commented 4 years ago

Thinking about a national/european measuring network of sensors, I found: https://airquality-frost.docker01.ilt-dmz.iosb.fraunhofer.de/v1.1/

But some (queryable?) interesting facts about this dataset fail because of gateway timeouts.

For example I was wondering about the current number of observations in that dataset: https://airquality-frost.docker01.ilt-dmz.iosb.fraunhofer.de/v1.1/Observations?$count=true (does not return the count but times out)

Another (to me very important query, because in QGIS I want to show in a (live) overview map of all most recent measurements) would be something like: "give me for every Thing the latest Observations (and from that the corresponding Location)), which I think could translate to: https://airquality-frost.docker01.ilt-dmz.iosb.fraunhofer.de/v1.1/Things?$count=true&$select=name,description&$expand=Locations($select=name,location),Datastreams($expand=Observations($orderby=phenomenonTime%20desc;$top=1)) but: same issue: timeout

So I was wondering: are these 'corner-case' use cases which I should not ask (though they seem to me important queries...) Or are they valid, but become too slow because of the size of the db/dataset?

In the latter case: would it be an idea to optimize this kind of queries maybe? Not sure how though... I understand that (probably) the database becomes the bottleneck?

hylkevds commented 4 years ago

Yes, some queries time out on large data sets. Especially any queries involving large count results, since counts are very slow on relational databases. The reason for this is that the db can't use an index to get this count, since indices are shared across transactions. So the database has to validate every single entry in the table to see if it is actually visible in the current transaction.

This service you mentioned has around 300 million Observations. Counting all of these simply takes ages.

We could try to fetch an estimate first, by using the output from explain analyze, and get the real count only if the estimate is small (<1000)...

The docker01 service is also an experimental one, on quite slow hardware. It'll be redirected to our new one soon: https://airquality-frost.k8s.ilt-dmz.iosb.fraunhofer.de/v1.1/T The new service does give results for your second query: https://airquality-frost.k8s.ilt-dmz.iosb.fraunhofer.de/v1.1/Things?$count=true&$select=name,description&$expand=Locations($select=name,location),Datastreams($expand=Observations($orderby=phenomenonTime%20desc;$top=1))

rduivenvoorde commented 4 years ago

Ah, thanks @hylkevds. Yes, you are right, with that much records I should not have asked for a count... And yes, the new service is pretty fast!!

I think for the average viewer(front end) of a dataset, one of the most asked queries will be: get me the latest observation values and locations of a certain phenomenon (to show on my map...).

So I'm going to try to find the query to: for every Thing measuring NO2 get me the latest Observation AND location, to be able to show it in QGIS (so retrieve it via GeoJson). Give me some time, unless somebody has it available, let me know :-)

One thing that strikes me when you retrieve the query as GeoJSON is that you end up with (in QGIS) with features with >200 attributes.... Crafting your queries becomes pretty important if you do not want to bloat your client with unnecessary data.... To me as an average Joe, it is not (yet) fully clear how to structure a query, for examplefor a usecase as depicted above:

should I search for Things and the scrape the latest Observation of a certain Datastream to them?
or should I search for the latest Observations and scrape the Things/Locations in it (unique...)?

Just for fun: all 4382 Things in the DB in QGIS (live retrieved based on a GeoJSON query:

Screenshot-20201015115001-1575x918

hylkevds commented 4 years ago

That QGis map looks good! And indeed, $select is very important to reduce the data load.

Where to start your query... that's a good point, I should add a section about that to our HowTo. In general, if you want your Observations somehow grouped (Latest per datastream, latest for a location) you can not start from /Observations. The location can come from either Thing, or FeatureOfInterest... In your specific example, getting only NO2 for each location, you could start from any of FoI, Datastream, Thing or Location:

FeatureOfInterest: v1.1/FeaturesOfInterest?$expand=Observations($filter=Datastream/ObservedProperty/name eq 'NO2')
Datastreams: v1.1/Datastreams?$filter=ObservedProperty/name eq 'NO2'&$expand=Observations($expand=FeatureOfInterest)
The Thing example you already gave
etc. I've left out the $select, $orderby and $top=1 of course.

If you want more than one ObservedProperty, you also can't start from FeaturesOfInterest, since you can only get one "latest" per FoI, and if you need two ObservedProperties, you would need two "latest" per FoI. But you can start from Datastreams, or any of the other Entities.

Where you start determined how many entities you get back. Starting from Datastreams gives you a different Entity for each ObservedProperty. Starting from Thing gives you all Datastreams for this thing in the same Thing Entity. Starting from Locations would give you all things from that Location in the same Location Entity.

FraunhoferIOSB / FROST-Server

Optimize certain queries? #287