New spatial index columns in cache_* tables not always filled in

johnvanbreda commented 8 years ago

Looking through the records coming in on warehouse1, it appears that approx 10% of the records don't get their location_id_vice_county or similar fields filled in in cache_occurrences_functional despite having an entry in index_locations_samples that implies they fall entirely inside a vice county. Martin Harvey also reported a situation where some records did not appear to be indexed for a few hours despite the logs indicating the spatial indexer was running normally at that time.

One of the issues might relate to the fact that if a sample is added first, then later an occurrence added, this does not trigger the indexing of that occurrence specifically. This was OK before because only samples were indexed, now that occurrences are indexed as well they need to be picked up separately. However there are also cases where the cache_samples_functional table did not pick up the link to the vice county location (e.g. 3295358 and subsequent records) which need further investigation.

johnvanbreda commented 8 years ago

Checking the code, I realised that any exceptions raised by the query which fills in the locationid* fields would in fact simply get emitted to the screen and not logged. I’ve fixed this so that the exception is logged (see https://github.com/Indicia-Team/warehouse/commit/4cf0de960b815a9dd87dbe267ddc7d98a84b5d01) and left it running a few days to catch some of the problems.

I’ve now got some information and I can see that the problem is some sort of deadlocking issue. Here’s an example – I’ll try and explain the timeline all of which occurred yesterday (3rd July). 11:47:51 – a UKBMS transect section sample is saved. This means the user had filled in the details of a transect and proceeded to the data entry grid. NOTE – this grid allows you to input values for each count per species per section of the transect and saves the occurrences as you go along. The user seemed to wait a few minutes, then proceeded to enter 6 records spread fairly evenly between 11:54:12 and 11:55:58. 11:55:11 – the spatial_index_builder runs a query which captures all the samples that have been updated since the last run and puts the Ids into a temporary table. It would therefore have captured some of the records in the sample but some would have come in after this set of records to process was captured. 11:56:06 – a deadlock error is logged. The spatial index builder is trying to update the samples in the temporary table to fill in the locationid* fields and the record being saved by the UKBMS user has triggered an update of the samples cache table to fill in the map squares. Both are hitting the same row at the same time and therefore it deadlocks. I can find reference to unordered updates causing deadlocks but I’m surprised that’s the case here since one of the queries is updating a single record so the order should not matter.

Although I think I can improve the logic to make unnecessary updates of the cache tables less likely which might well have fixed this instance, there is an underlying problem which I’m not sure I know the resolution to yet. Any thoughts appreciated.

johnvanbreda commented 8 years ago

Script which can be run to fix existing records whose locationid* entries were not filled in:

select s.id into temporary smplist from cache_samples_functional s join index_locations_samples ils on ils.sample_id=s.id and ils.location_type_id=2188 and ils.contains=true where s.location_id_country is null;

UPDATE cache_samples_functional u SET location_id_vice_county = ils15.location_id, location_id_lrc_boundary = ils1370.location_id, location_id_country = ils2188.location_id FROM ( SELECT s.id FROM samples s
JOIN smplist list on list.id=s.id ORDER BY s.id ) upd LEFT JOIN index_locations_samples ils15 on ils15.sample_id=upd.id and ils15.location_type_id=15 and ils15.contains=true LEFT JOIN index_locations_samples ils1370 on ils1370.sample_id=upd.id and ils1370.location_type_id=1370 and ils1370.contains=true LEFT JOIN index_locations_samples ils2188 on ils2188.sample_id=upd.id and ils2188.location_type_id=2188 and ils2188.contains=true WHERE u.id=upd.id;

UPDATE cache_occurrences_functional u SET location_id_vice_county = ils15.location_id, location_id_lrc_boundary = ils1370.location_id, location_id_country = ils2188.location_id FROM ( SELECT s.id FROM samples s
JOIN smplist list on list.id=s.id ORDER BY s.id ) upd LEFT JOIN index_locations_samples ils15 on ils15.sample_id=upd.id and ils15.location_type_id=15 and ils15.contains=true LEFT JOIN index_locations_samples ils1370 on ils1370.sample_id=upd.id and ils1370.location_type_id=1370 and ils1370.contains=true LEFT JOIN index_locations_samples ils2188 on ils2188.sample_id=upd.id and ils2188.location_type_id=2188 and ils2188.contains=true WHERE u.sample_id=upd.id;

johnvanbreda commented 8 years ago

Proposed resolutions:

Check why entering a UKBMS record causes the sample to be treated as updated as I don't think it needs to
Look into ordering the updates to prevent potential deadlocks (e.g. see http://stackoverflow.com/questions/27007196/avoiding-postgresql-deadlocks-when-performing-bulk-update-and-delete-operations).

kitenetter commented 8 years ago

Not sure this adds anything to what we already know, but for info I added some records this morning that should be linked to VC South Essex. Records up to ID 3638146 seem to have been linked successfully, but a set of records from 3638297 to 3638304 have not (yet) been linked to the VC.

The records were entered as a list at 09:31, but I had taken quite a long time to complete the list, as I was faffing around adding photos and trying to add associated species to the comments. I last checked using the VC filter in My Records at 12:15, and they were not being included for South Essex.

johnvanbreda commented 8 years ago

Scratch my previous thinking on this. The issue looks to be caused when the spatial index builder creates links to vice counties (and other indexed boundaries) but sets flags indicating that the boundary lies fully inside more than one boundary of a given type. E.g. the point might appear to lie fully inside 2 different vice counties. This could be a code error or a problem in the boundaries - now investigating. Effectively you end up with 1 update query that tries to update the same record twice, causing the deadlock.

johnvanbreda commented 8 years ago

Ok, I've got it. A while back we wanted to allow verification and other filters to be applied to either the whole of Yorkshire, or the whole of Sussex, combining the various county boundaries within them into a single "supercounty". So I created 2 extra vice county boundaries and these are now used by a number of verification and reporting filters.

The new cache structure code uses the fact that a record which lies inside 1 vice county should never exist inside another to enable a significant optimisation of the reporting queries. However the creation of Sussex and Yorkshire as vice counties breaks this rule and with it, breaks the code.

A proposed solution follows. Each location layer can be treated in one of 3 ways by the "spatial index builder" - the module responsible for pre-calculating the joins between samples and locations to speed up various reporting. It can:

ignore the layer. This is what we do for "My sites" since most boundaries will be fairly small and simple so not worth the overhead of running all the indexation.
index it, creating a join record in the index_locations_samples table so you can easily find which samples join to which location.
as above, but if the layer does not contain overlapping boundaries you can flag it as "unique" resulting in a foreign key field being added to the reporting cache tables joining them straight to the location. This is MUCH faster than the join through index_locations_samples but only works if there can only be 1 boundary linked to the sample.

So, the solution I think is to move the Yorkshire and Sussex boundaries into a different layer so they are no longer in the Vice Counties layer. Then they won't clash with the existing indexing. This means that we'll need to add this other layer to the various "Explore" pages to enable it for use, so they won't appear in the list of vice counties but you'll need to pick a different location type to choose from in order to find them.

We already have a layer called "Miscellaneous indexed boundaries" that currently just has a buffered version of Ireland in it plus a boundary of the Cairngorms NP. We could add the 2 supercounty layers to this, but as it is not declared as requiring no overlaps for the boundaries it contains we would be reduced to the slower way of filtering to either Yorkshire or Sussex records. Or, we could create a new "Combined counties" layer and declare that it requires no overlaps, maintaining current performance. If we do this it means we will have a new column in the cache tables on the server (and therefore a slight memory hit).

Martin - given this do you agree we should create a new layer called "Combined counties"?

JimBacon commented 8 years ago

Some good detective work there, John. Just to chip in, it seems there might be an alternative approach where you don't worry about a combined county layer, or the location boundaries for them, or indexing them. Instead you somewhere define that Yorkshire is composed of vice counties x, y and z and that any query against Yorkshire is replaced by a vice_county in [x,y,z].

kitenetter commented 8 years ago

Sounds like good progress. For combined VCs it makes sense to have a consistent way of dealing with them (and there could be further requests for other combinations in future). Jim's approach seems to me to be more flexible if it can be implemented, but if not then it sounds like a new "Combined counties" layer is needed.

It's likely that over time we will receive further requests to provide 'non-standard' boundaries for particular projects or surveys; it would be good to know whether the "Miscellaneous indexed boundaries" layer can cope with more, or whether we need to think of other solutions, or whether we need to discourage the use of non-standard boundaries unless absolutely necessary.

johnvanbreda commented 8 years ago

Thanks both. Good to have another eye on it as I've been staring at the problem all afternoon. In fact this makes me wonder if we should focus on implementing https://github.com/BiologicalRecordsCentre/iRecord/issues/3 then we can simply switch over the existing filters to use the list of contained vice counties. This means there would be no such location as "Yorkshire" or "Sussex" but you would be free to combine 2 or more into 1 filter. The filter would be available for general reporting, downloads, verification filters or activities. Does that sound sensible?

DavidRoy commented 8 years ago

Yes, that was going to be my suggestion

johnvanbreda commented 8 years ago

I've implemented BiologicalRecordsCentre/iRecord#3 now. I've also switched all the filters that currently use the Sussex boundary over to using a combination of East and West Sussex. However, before going any further we need to consider what will happen to people who have chosen Sussex or Yorkshire as their preferred recording Vice county in their user profile? If we remove these boundaries as suggested this might mess up things when they visit All Records and other explore pages.

johnvanbreda commented 8 years ago

Just for safekeeping - some scripts used to swap filters from Sussex over to the constituent vice counties: ` select * from locations where name='Sussex' and location_type_id=15 --16217 select * from locations where name='East Sussex' and location_type_id=15 -- 736 select * from locations where name='West Sussex' and location_type_id=15 -- 735

update filters set definition=replace(definition, '"indexed_location_id_context":"16217"', '"indexed_location_list_context":"735,736"') where definition like '%"indexed_location_id_context":"16217"%'; update filters set definition=replace(definition, '"indexed_location_id":"16217"', '"indexed_location_list":"735,736"') where definition like '%"indexed_location_id":"16217"%';

update filters set definition=replace(definition, '"imp-location":"16217",', '') where definition like '%"imp-location":"16217",%'; update filters set definition=replace(definition, '"imp-location_context":"16217",', '') where definition like '%"imp-location_context":"16217",%';

update filters set definition=replace(definition, '"remembered_location_name":"Sussex"', '"remembered_location_name":"East Sussex, West Sussex"') where definition like '%"remembered_location_name":"Sussex"%' and definition like '%"735,736"%'; `

kitenetter commented 8 years ago

I didn't realise that you could choose "Yorkshire" or "Sussex" as a vice-county, and presumably this explains why some records, when downloaded, have these incorrect names in the vice-county column. From the point of view of downloading data for a recording scheme it is enormously unhelpful to have non-existent vice-counties in the VC column! So I would prefer to see them got rid of, at least at the records level.

Do the user profiles allow for choosing multiple VCs, in which case can we replace "Sussex" with "East Sussex + West Sussex" in the profiles? If the profiles don't allow for multiples then I suppose we need to keep "Sussex", but I would very much want that boundary to be excluded from use when calculating the VC allocation of an individual record. Or could we give people the choice of vice-county or modern (LRC) boundary in their profiles, in which case we could switch the pseudo-VC "Sussex" to the Sussex LRC boundary?

johnvanbreda commented 8 years ago

I've added a separate issue (#187) which provides a quick fix for the crash in the spatial index builder. However we still need to resolve the issue of people who've picked Sussex or Yorkshire as their location, because they will now not be indexed in the cache__.location_id__ fields. Moving these 2 locations to a new location type and modifying the user profile to allow multiple location types to be selected from would fix the remaining issues.

Indicia-Team / warehouse

New spatial index columns in cache_* tables not always filled in #148