Walk data not matching summary data - Githubissues

BiologicalRecordsCentre / UKBMS-online

Issue tracking for UKBMS online recording site

2 stars 0 forks source link

Walk data not matching summary data #291

Open IanMiddlebrook opened 1 year ago

IanMiddlebrook commented 1 year ago

A recorder for Lullington Heath (locationID=2068) noted Heath Fritillary in the list for 'species previously known at this site', which he says is an error.

On investigating, I found a record from 2010. Since this is certainly an error, I made an attempt to delete it.

On the Annual Summary page, the data for 22nd June 2010 presents exactly as expected, matching the data that were originally submitted via Transect Walker at the time (now uploaded to the website as part of the historic data set), including the Heath Fritillary.

However, when I went to the Walk Data Entry page (sample_id=11456633), the data presented were completely different, and no Heath Fritillary was indicated, so I was unable to delete it.

Two questions, how is it possible for the data in the sample to be so different from the summary? And presumably this won't be the only historic sample with the problem, so is it possible to identify others?

Thanks, Ian

Gary-van-Breda commented 8 months ago

@IanMiddlebrook : the problem originates in the data upload. Each walk should have a single subsample for each section: this upload has resulted a multitude of sub-samples for each section - e.g. there are 6 sub-samples for S13, which is where the Heath Fritillary resides. The Annual summary is correct, but the Walk DE page can only handle one sub-sample for each section. Yes - it should be possible to identify others (I'll do that), but certainly the same site's previous and subsequent week's walks are also problematical - so I'd assume at the very least all the occurrence data uploaded at the same time also has this problem.

The issue arises from performance optimisations, which means that the records must be grouped together by date/section in the upload file - it looks like this was not the case.

DavidRoy commented 8 months ago

@Gary-van-Breda, yes, this is a known problem. The importer expected all the sub-samples to be sequential in the import file and I don't think this was always the case with the files - hence why multiple sub-samples were created. can you suggest a way of correcting at the db end?

Gary-van-Breda commented 8 months ago

For info only: Running

select distinct walkid
from (SELECT p.id walkid, c.location_id, count(distinct c.id)
from samples p
JOIN samples c ON p.id = c.parent_id and c.deleted = 'f' and c.sample_method_id = 776
JOIN occurrences o ON c.id = o.sample_id and o.deleted = 'f'
WHERE p.survey_id=64
  AND p.deleted = 'f'
  AND p.parent_id is null
  and p.sample_method_id = 22
Group by p.id, c.location_id
Having count(distinct c.id) > 1) details

Gives approximately 168k walks with this multiple subsample issue.

Gary-van-Breda commented 8 months ago

@DavidRoy : I can do something at the DB end - it'll take a while to run, but should be do-able.

Gary-van-Breda commented 8 months ago

@IanMiddlebrook : I'd like to build this up to gain confidence in the solution. To that end, I've run my script against just sample_id=11456633 (Lullington Heath 22nd June 2010) - can you check it looks OK to you?

If that is OK, I'll do all of Lullington Heath for 2010; then all of Lullington Heath; then everything else.

Gary-van-Breda commented 8 months ago

For Info only

SELECT walk.id walk_id, subsample.location_id section_location_id, subsample.id, o.taxa_taxon_list_id, count(distinct o.id)
    from samples walk
    JOIN samples subsample
      ON walk.id = subsample.parent_id
      AND subsample.deleted = 'f'
      AND subsample.sample_method_id = 776
    JOIN occurrences o
      ON subsample.id = o.sample_id
      AND o.deleted = 'f'
    WHERE walk.survey_id=64
      AND walk.deleted = 'f'
      AND walk.parent_id is null
      and walk.sample_method_id = 22
-- AND walk.id = 11456633
-- AND walk.location_id = 2068
    Group by walk_id, section_location_id, subsample.id, o.taxa_taxon_list_id
    Having count(distinct o.id) > 1
    ORDER BY walk_id desc, section_location_id

reveals that there are 4156 incidents where a section has more than one occurrence record for a single species. This is obviously a smaller problem than the duplicate subsamples, but should be addressed as well.

Gary-van-Breda commented 8 months ago

For Info only

SELECT walk.id walk_id, subsample.location_id section_location_id, subsample.id, o.taxa_taxon_list_id, count(distinct o.id)
    from samples walk
    JOIN samples subsample
      ON walk.id = subsample.parent_id
      AND subsample.deleted = 'f'
      AND subsample.sample_method_id = 776
    JOIN occurrences o
      ON subsample.id = o.sample_id
      AND o.deleted = 'f'
    WHERE walk.survey_id=64
      AND walk.deleted = 'f'
      AND walk.parent_id is null
      and walk.sample_method_id = 22
-- AND walk.id = 11456633
-- AND walk.location_id = 2068
    Group by walk_id, section_location_id, subsample.id, o.taxa_taxon_list_id
    Having count(distinct o.id) > 1
    ORDER BY walk_id desc, section_location_id

reveals that there are 4156 incidents where a section has more than one occurrence record for a single species. This is obviously a smaller problem than the duplicate subsamples, but should be addressed as well.

IanMiddlebrook commented 8 months ago

Hi @Gary-van-Breda 22-6-2010 for Lullington Heath looks good to me.