BiologicalRecordsCentre / UKBMS-online

Issue tracking for UKBMS online recording site
2 stars 0 forks source link

data missing from Occurrences download #297

Closed IanMiddlebrook closed 1 year ago

IanMiddlebrook commented 1 year ago

Hi @DavidRoy @Gary-van-Breda

There's seems to be a fundamental issue with the Occurrences download on the Annual Summary page, which appears to be incomplete.

The Section Level and Occurrences downloads should have exactly the same number of records, but there are data consistently missing from the latter.

For example - downloading all the WCBS-BBS (butterfly-only) data for 2022 gives 11,902 records for Section Level, but only 11,728 records for Occurrences. What's missing is all the records for a handful of sites: SP9931 SS5327 Newton Tracy SS5329 St Johns Chapel IH2528 TG1908

If I try to download Occurrences for any of these sites individually - the download is blank.

The same is true of the WCBS-BC data (300 records missing between Section Level and Occurrences) and Transect data (3,400 records missing).

This is absolutely critical, as I use the Samples and Occurrences downloads to review and collate the year's data before passing to UKCEH for analysis.

Thanks in advance, Ian

DavidRoy commented 1 year ago

@JimBacon can you pick this up

JimBacon commented 1 year ago

Looking at the settings for the Annual Summary page I can see that the following reports are used to extract data

The section level report extracts records from the samples and occurrences tables. The occurrences report extracts records from the samples and cache_occurrences_functional tables.

Looking initially at the example of the site, IH2528, I found that the occurrences for this site are not present in the cache table. This is one explanation to why an inconsistency is observed.

In the case of site SP9931, the issue is that the cache_occurrences_functional.location_id field is null, rather than containing the id of the section.

To test whether these two things explain all the inconsistencies across all sites and surveys I ran the following query.

SELECT 
    surveys.title,
    locations.name,
    count(o.id) filter (where cof.id is null) null_occurrence,
    count(o.id) filter (where cof.id is not null and cof.location_id is null) null_location
FROM samples site
JOIN surveys 
    ON surveys.id = site.survey_id 
JOIN locations 
    ON locations.id = site.location_id
JOIN samples section
    ON section.parent_id = site.id
    AND section.deleted = false
JOIN occurrences o
    ON o.sample_id = section.id
    AND o.deleted = false
JOIN cache_taxa_taxon_lists taxa
    ON taxa.id = o.taxa_taxon_list_id
LEFT JOIN cache_occurrences_functional cof
    ON cof.id = o.id
WHERE surveys.website_id = 27 -- UKBMS
    AND site.date_start >= '2022-01-01' 
    AND site.date_end <= '2022-12-31'
    AND site.parent_id is null
    AND site.deleted = false
    AND taxa.taxon_group = 'insect - butterfly'
GROUP BY surveys.title, locations.name
HAVING 
    count(o.id) filter (where cof.id is null) > 0 
    OR count(o.id) filter (where cof.id is not null and cof.location_id is null) > 0
ORDER BY surveys.title, locations.name

The summary of results from the query is

These values are consistent with the figures Ian gives above.

I haven't any clues as to how these errors have arisen but I can go ahead and resolve them by correcting the cache table.

IanMiddlebrook commented 1 year ago

That's great - thanks for investigating

JimBacon commented 1 year ago

Hi @IanMiddlebrook

I've looked deeper in to the causes of the discrepancy you have observed and I find that, in most cases, it is a result of records being marked as confidential. One of the reports is returning those records and one is not.

Some sites are declared confidential, presumably because you don't wish to reveal their location. Given this, should they be incuded in these downloads or not?

IanMiddlebrook commented 1 year ago

Hi @JimBacon

Yes, all these records should be included in the downloads. Although some Sites are marked as confidential, this is with regard to public visibility. Any User linked to these Sites should still be able to download the data in full.

Thanks, Ian

JimBacon commented 1 year ago

I have set up a revised report for occurrence download which now includes records from confidential sites.

The other cause of missing records turns out to relate to deleted user accounts. One report returns records from users who have had their account deleted while the other does not. It would be normal to retain records when a user deletes their account. I have yet to investigate whether

I need to understand this better to determine how to fix it.

IanMiddlebrook commented 1 year ago

Thanks @JimBacon It's strange that data from these sites have been excluded from the 'Occurrences' download when they appear in the 'Samples' download. These two downloads should act together as a database, linked by the Sample_ID, so should include all data from the same set of Sample_IDs .

JimBacon commented 1 year ago

Crudely speaking, one download says 'get all records' and the other says 'get all records except confidential ones' so it makes perfect sense. However, the complexity of Indicia means that this is far from obvious.

JimBacon commented 1 year ago

Looking at WCBS-BBS survey as an example, records are missing from the cache created by 3 different users who are now marked as deleted.

It is evident, on examining the relevant users and people records, that the deletions were the consequence of merging duplicate accounts for the same individuals.

The occurrences missing from the cache were created after the user deletions. Examining the settings on the ukbms.org site confirms that the three people are still linked to their deleted warehouse accounts.

I've raised an issue, https://github.com/Indicia-Team/drupal-8-modules-indicia_features/issues/5, in relation to this problem. The tasks to perform are

JimBacon commented 1 year ago

The following accounts were automatically updated by masquerading as the user and allowing the easy login module to attach to valid warehouse user ids.

drupal user id old warehouse id updated warehouse id
6167 768 340388
3995 71365 144574
5458 128101 161921
208 18910 177
26507 187580 130981
2524 79580 938

Samples and occurrences were updated by queries like the following, copied from what the behaviour of the mergeUsers function in modules/indicia_svc_security/helpers/user_identifier.php

update occurrences set created_by_id = 340388 where created_by_id = 768;
update occurrences set updated_by_id = 340388 where updated_by_id = 768;
update samples set created_by_id = 340388 where created_by_id = 768;
update samples set updated_by_id = 340388 where updated_by_id = 768;
update cache_occurrences_functional set created_by_id = 340388 where created_by_id = 768;
update cache_samples_functional set created_by_id = 340388 where created_by_id = 768;

The cache tables were updated by queries like the following

insert into work_queue(task, entity, record_id, params, cost_estimate, priority, created_on)
select 'task_cache_builder_update', 'sample', id, null, 100, 2, now()
from samples where created_by_id = 340388 order by id;

insert into work_queue(task, entity, record_id, params, cost_estimate, priority, created_on)
select 'task_cache_builder_update', 'occurrence', id, null, 100, 2, now()
from occurrences where created_by_id = 340388 order by id;

Two further Drupal accounts, listed below, appear to have been taken over by new people but have still been recording against the previous user id, now deleted. The new users have been emailed to ask when they took over to try to clarify which user records should be attributed to.

drupal user id old warehouse id
635 19070
763 20649
JimBacon commented 1 year ago

Having heard back from a recorder, Drupal user id 635 was updated from warehouse user id 19070 to 137200.

Records made by warehouse user 19070 at Blakeney Friary Hills (transect location id 36495) were reassigned to user 137200 using the following query

create temporary table if not exists to_update as
    select t.id transect_id, 
        s.id section_id, 
        o.id occurrence_id
    from samples t
    join samples s 
        on s.parent_id = t.id
    join occurrences o 
        on o.sample_id = s.id
    where t.location_id = 36495 -- Blakeney Friary Hills
        and o.created_by_id = 19070;

update samples set created_by_id = 137200 
    where id in (select distinct transect_id from to_update);
update cache_samples_functional set created_by_id = 137200 
    where id in (select distinct transect_id from to_update);
insert into work_queue(record_id, task, entity, params, cost_estimate, priority, created_on)
select distinct on (transect_id)  
    transect_id, 'task_cache_builder_update', 'sample', null, 100, 2, now()
from to_update order by transect_id;

update samples set created_by_id = 137200 
    where id in (select distinct section_id from to_update);
update cache_samples_functional set created_by_id = 137200 
    where id in (select distinct section_id from to_update);
insert into work_queue(record_id, task, entity, params, cost_estimate, priority, created_on)
select distinct on (section_id)  
    section_id, 'task_cache_builder_update', 'sample', null, 100, 2, now()
from to_update order by section_id;

update occurrences set created_by_id = 137200 
    where id in (select occurrence_id from to_update);
update cache_occurrences_functional set created_by_id = 137200 
    where id in (select occurrence_id from to_update);
insert into work_queue(record_id, task, entity, params, cost_estimate, priority, created_on)
select occurrence_id, 'task_cache_builder_update', 'occurrence', null, 100, 2, now()
from to_update order by occurrence_id;

update samples set updated_by_id = 137200 
    where id in (select distinct transect_id from to_update)
    and updated_by_id = 19070;
update samples set updated_by_id = 137200 
    where id in (select distinct section_id from to_update)
    and updated_by_id = 19070;
update occurrences set updated_by_id = 137200 
    where id in (select occurrence_id from to_update)
    and updated_by_id = 19070;
JimBacon commented 1 year ago

Realising that the sightings from multiple recorders are being submitted via the Drupal user account 635, it is more relevant to attach the current user of that account to all the records submitted by the deleted warehouse user 19070 than to attribute any to a valid account of the original user. To do that, the above query was repeated for the transect at Blakeney Point (transect location id 3330)

IanMiddlebrook commented 1 year ago

Hi @JimBacon thanks for all your work on this. I've been able to download the WCBS Occurrences and confirm they have the same number of records as the Section Level download - so can progress with these. @DavidRoy - I've been trying all week to download the transect Occurrences without success - timing out again and again and again. I can't process the data for Marc without this download.

Thanks, Ian

JimBacon commented 1 year ago

@IanMiddlebrook Some progress then! The only errors I am still aware of are for a handful of records made in 2019 or before. I, too, have tried and failed to do the download for the transect surveys. @DavidRoy , do we need to try adding an ElasticSearch download page? My impression is they perform better, albeit there are sometimes other kinds of inconsistencies to iron out.

DavidRoy commented 1 year ago

@JimBacon given the urgency and the problem @IanMiddlebrook has had with the existing download, can you run this query directly on the db?

We have elasticsearch equivalent downloads for the European BMS so it would be good to switch across

JimBacon commented 1 year ago

I have tried three times and the server responded with an 'Out of Memory' error each time. I'll try some more economical queries tomorrow and perhaps ElasticSearch too. What is the minimum set of columns you need in the download, @IanMiddlebrook?

IanMiddlebrook commented 1 year ago

Hi @JimBacon

I expect you realise this, but just to clarify, it's only the 2022 data I need to process right now.

Thanks, Ian

JimBacon commented 1 year ago

Yes, I got that. There are about 500k rows in 2022. The occurrence download looks like the following. Can we remove any of the columns, Site Type, for example, or the common name?

Sample ID Occurrence ID Site Type Site Code Date Section No. Species Preferred
20525149 28519094 Transect 2073 2022-04-01 S4 Peacock Inachis io
18155362 24952814 Transect 2762 2022-04-01 S1 Small Tortoiseshell Aglais urticae
IanMiddlebrook commented 1 year ago

Hi @JimBacon For my purposes, you can remove 'Site Type' and 'Preferred'

JimBacon commented 1 year ago

For the record, I've used the following query and ran it a month at a time. Even July, with 194k records returned with reasonable speed.

SELECT 
    o.parent_sample_id as sample_id,
    o.id as occurrence_id,
    l.code as site_code,
    o.date_start as date,
    l2.code as section_name,
    coalesce(cttl.default_common_name, cttl.preferred_taxon) as common
FROM cache_occurrences_functional o
JOIN samples parent ON parent.id = o.parent_sample_id
JOIN samples child ON child.id = o.sample_id
JOIN locations l ON parent.location_id=l.id
JOIN locations l2 ON child.location_id=l2.id
JOIN cache_taxa_taxon_lists cttl ON cttl.id=o.taxa_taxon_list_id
WHERE 
    o.survey_id=64
    AND o.date_start >= CAST('2022-01-01' as date)
    AND o.date_start <= CAST('2022-01-31' as date)
ORDER BY l.code, o.date_start, l2.code, cttl.taxonomic_sort_order, o.id DESC

Results sent by email.

DavidRoy commented 1 year ago

Thanks @JimBacon - much appreciated. Let's evaluate option when we've migrated to the new UKCEH infrastructure

IanMiddlebrook commented 1 year ago

Thanks @JimBacon

IanMiddlebrook commented 1 year ago

Sorry @JimBacon I've just looked at the download and there's a critical field missing - Abundance Count Apologies, I didn't spot that omission on your earlier post, I was only looking for fields that could be removed.

Could you possibly redo the export with the Abundance Count included?

Many thanks, Ian

JimBacon commented 1 year ago

I think this issue can be closed but I don't appear to have permissions to do that.

DavidRoy commented 1 year ago

@IanMiddlebrook to close if happy?