Open johnvanbreda opened 9 years ago
Exploration of several similar queries raised the following areas of concern:
In this example, there is an occurrence attribute called #sex_stage_count which tells the query to join in all occurrence attributes that are flagged as being "Count or abundance of a sex or stage". I.e. they fall into the category of abundance data and can be output in a single column in the report. The query generates a large list of joins to support these types of psuedo attributes as each survey dataset can have a different attribute definition for abundance. The values retrieved from the joins are then coalesced to result in a single abundance value. The same process occurs for biotopes, sex, stage etc.
Experiments with example queries showed that the long lists of joins could be reduced down to a simpler set of joins for each system function:
LEFT JOIN (occurrence_attribute_values oavSexStageCount
JOIN occurrence_attributes oaSexStageCount on oaSexStageCount.id=oavSexStageCount.occurrence_attribute_id and oaSexStageCount.deleted=false and oaSexStageCount.system_function='sex_stage_count'
LEFT JOIN cache_termlists_terms tSexStageCount on tSexStageCount.id=oavSexStageCount.int_value and oaSexStageCount.data_type='L'
) on oavSexStageCount.occurrence_id=o.id and oavSexStageCount.deleted=false
Then, we can output the attribute value with a simple COALESCE, something like:
COALESCE(tSexStageCount.term, oavSexStageCount.text_value, oavSexStageCount.int_value::varchar, oavSexStageCount.float_value::varchar) as attr_sex_stage_count,
Although this change would dramatically reduce the number of lines and apparent complexity of the query, note that in my tests this change alone does not improve performance by anything more than a few percent. However it is still a worthwhile change.
Records can lie across the boundary of 2 vice counties. Therefore, when we join to the locations table to get the vice county information, we have to use PostgreSQL's array aggregate functions and do a GROUP BY to merge the vice counties for a record into a single string. Simply removing this output column from the query and allowing the query to run without the GROUP BY resulted in a query that I aborted after 1 minute taking approx 10 seconds, so this is a big performance improvement.
Obviously we still want to output the vice county data. Options are:
If you do a download report for, say, records since the beginning of the month, this triggers a sequential scan on cache_occurrences because the date_end field is not indexed. It makes a huge difference to these queries if the field is indexed. However note that the same problem would still occur for other date filters, e.g. verified on, or updated on. The impact of this scan would be lessed in #12 was implemented so I suggest that we revisit this aspect afterwards rather than make changes now.
I've implemented the suggested change:
Note from David: I think the ultimate solution is to add a field to the cache_samples table and populate it with the VC with the largest area within the grid square. This is the BRC approach. Schemes don’t find it helpful to have multiple VC values per occurrence as they are not setup to handle this. VC is the default admin unit so should always be included. It’s an open question whether we need other admin units added to cache – possible LRC boundary?
This could be explored further - maybe having a cache of linked location strings by location type - including only those which are compiled into the index by the spatial index builder. It would link to the proposed cache_samples table.
Performance of queries used to download records from the warehouse can be very slow. An example had the following parameters which resulted in the query apparently locking up: