johnvanbreda commented 9 years ago

Performance of queries used to download records from the warehouse can be very slow. An example had the following parameters which resulted in the query apparently locking up:

report=library/occurrences/filterable_nbn_exchange.xml
limit=20001
offset=0
quality_context=all
default_input_form_context=enter-record-list
higher_taxa_taxon_list_list_context=127378
marine_flag_context=all
quality=V
date_type=recorded
date_from=
date_to=
smpattrs=#biotope
occattrs=#det_full_name,#sex,#stage,#sex_stage_count
user_id=...
sharing=verification

johnvanbreda commented 9 years ago

Exploration of several similar queries raised the following areas of concern:

Joins for attributes that are linked to system functions

In this example, there is an occurrence attribute called #sex_stage_count which tells the query to join in all occurrence attributes that are flagged as being "Count or abundance of a sex or stage". I.e. they fall into the category of abundance data and can be output in a single column in the report. The query generates a large list of joins to support these types of psuedo attributes as each survey dataset can have a different attribute definition for abundance. The values retrieved from the joins are then coalesced to result in a single abundance value. The same process occurs for biotopes, sex, stage etc.

Experiments with example queries showed that the long lists of joins could be reduced down to a simpler set of joins for each system function:

LEFT JOIN (occurrence_attribute_values oavSexStageCount 
 JOIN occurrence_attributes oaSexStageCount on oaSexStageCount.id=oavSexStageCount.occurrence_attribute_id and oaSexStageCount.deleted=false and oaSexStageCount.system_function='sex_stage_count'
 LEFT JOIN cache_termlists_terms tSexStageCount on tSexStageCount.id=oavSexStageCount.int_value and oaSexStageCount.data_type='L'
 ) on oavSexStageCount.occurrence_id=o.id and oavSexStageCount.deleted=false

Then, we can output the attribute value with a simple COALESCE, something like:

COALESCE(tSexStageCount.term, oavSexStageCount.text_value, oavSexStageCount.int_value::varchar, oavSexStageCount.float_value::varchar) as attr_sex_stage_count,

Although this change would dramatically reduce the number of lines and apparent complexity of the query, note that in my tests this change alone does not improve performance by anything more than a few percent. However it is still a worthwhile change.

Output of Vice County

Records can lie across the boundary of 2 vice counties. Therefore, when we join to the locations table to get the vice county information, we have to use PostgreSQL's array aggregate functions and do a GROUP BY to merge the vice counties for a record into a single string. Simply removing this output column from the query and allowing the query to run without the GROUP BY resulted in a query that I aborted after 1 minute taking approx 10 seconds, so this is a big performance improvement.

Obviously we still want to output the vice county data. Options are:

Allow records to appear twice in the report, once for each VC (I don't like this idea but mention it as it's simple).
Only output the VC if the record entirely falls inside the VC. Others could be output as "No single VC available" or something along those lines.
Prebuild the output and store it in the cache_occurrences table. This would make more sense if we restructure into cache_occurrences + cache_samples first, (see #12), otherwise we are increasing the data volume in cache_occurrences which will be detrimental in itself. One further issue with this idea is that we could only prebuild the output for one location type - so a report on modern county boundaries would not be able to benefit from the same fix.
If there are 2 VCs, only output a single one and the one chosen is arbitrary (basing it on area inside the boundary would be nice but would probably be too slow).

Date filtering

If you do a download report for, say, records since the beginning of the month, this triggers a sequential scan on cache_occurrences because the date_end field is not indexed. It makes a huge difference to these queries if the field is indexed. However note that the same problem would still occur for other date filters, e.g. verified on, or updated on. The impact of this scan would be lessed in #12 was implemented so I suggest that we revisit this aspect afterwards rather than make changes now.

johnvanbreda commented 9 years ago

I've implemented the suggested change:

Only output the VC if the record entirely falls inside the VC.

johnvanbreda commented 9 years ago

Note from David: I think the ultimate solution is to add a field to the cache_samples table and populate it with the VC with the largest area within the grid square. This is the BRC approach. Schemes don’t find it helpful to have multiple VC values per occurrence as they are not setup to handle this. VC is the default admin unit so should always be included. It’s an open question whether we need other admin units added to cache – possible LRC boundary?

This could be explored further - maybe having a cache of linked location strings by location type - including only those which are compiled into the index by the spatial index builder. It would link to the proposed cache_samples table.

Indicia-Team / warehouse

Performance of download queries #21

Joins for attributes that are linked to system functions

Output of Vice County

Date filtering