BiologicalRecordsCentre / iRecord

Repository to store and track enhancements, issues and tasks regarding the iRecord website.
http://irecord.org.uk
2 stars 1 forks source link

NBN Record Cleaner phenology ruleset to use stage information #282

Open DavidRoy opened 6 years ago

DavidRoy commented 6 years ago

Requested by some groups to reflect life-stage in rule-checks. For example, recording of butterfly eggs over the winter should not trigger a notification of 'out of flight season'

Include within work for ruleset update

224

johnvanbreda commented 6 years ago

There is already a feature to support stages in the PeriodWithinYear rules. However I will update #224 with some notes I've made on integrating stage considerations with the proposed changes to make the rule application "intelligent". Here's an example: http://warehouse1.indicia.org.uk/index.php/verification_rule/edit/4108

kitenetter commented 4 years ago

Although this issue was previously closed, we haven't implemented ID rules based on life stages, and I don't think our phenology rules take life stages into account. I'm not sure if we have an approach ready but just not yet implemented, or if further development is needed to work out an approach.

@johnvanbreda could you confirm whether we have yet incorporated any life-stage information into the phenology and ID difficulty rules, and how large-scale this issue will be to tackle?

johnvanbreda commented 4 years ago

Yes, we have always had this implemented in the code as we originally tried to exactly match the Record Cleaner specification. To work, the stage has to be in a custom occurrence attribute with system function 'stage' or 'sex_stage' and must exactly match the stage in the rule (though variations in case are accepted).

kitenetter commented 4 years ago

When you say "must exactly match the stage in the rule" where are those stages defined? I cannot see anything in the verification rules tables that refers to a stage.

johnvanbreda commented 3 years ago

@kitenetter the data structure used to capture the rule information is designed to be generic so it can handle the key/value pairs of metadata that define the different types of rules flexibly. That's why you can't find a field named Stage or similar. If you look in the output of the following query you'll see lots of rules have stage information:

select * from verification_rule_data where key='Stage'

Here's an example: http://warehouse1.indicia.org.uk/index.php/verification_rule/edit/5517. Unfortunately, in this case and in many others, the date range is applied to the stage limited "Other data" section, but is also applied to the non-stage limited "Metadata" section. This means that records of Epistrophe eligans will pass the rule if an adult male or female are within the date range given, OR if the record falls within the same date range no matter what the life stage. So the non-stage limited part of the rule cancels out the stage information which seems to be a mistake. If you look at the original rule file this was imported from it does seem to be the case that the error came from the rule file, not the Indicia import:

[Metadata]
TestType = PeriodWithinYear
Group = Hoverflies
ShortName=Epistrophe eligans: check date is during flight period
Description=Check date is during the known flight period for Epistrophe eligans
tvk = NBNSYS0000006912
startDate = 0206
endDate = 1107
ErrorMsg = Date is outside flight period for Epistrophe eligans
LastChanged = 20090615
[EndMetadata]

[Data]
Stage = All
startDate = 0206
endDate = 1107

Stage=Adult,Adult Male,Adult Female,Male,Female
startDate = 0206
endDate = 1107

In fact, not only is there a non-stage limited date range given, there is yet another given for stage "All". Other examples I found seemed to have similar problems.

JimBacon commented 3 years ago

I'm stumbling over numerous problems looking at this.

Firstly, Record Cleaner appears to be broken when it comes to PeriodWithinYear rules having stage information. Neither Les Hill nor I can make it work. It produces an error "Run PeriodWithinYear - Index was outside the bounds of the array." Unfortunately there are no further details to help determine whether it is a fault in the code, in the rule file, or the species data. I've tried numerous permutations which lead me to suspect it is the code. I've tried with the examples of rules given above, for Epistrophe eligans and Agraecina striata, as well as my own.

The reason this is relevant to Indicia is that I wanted to confirm the file format. The rule files for both the above species attempt to give dates for two different stages but, querying the Indicia database only returns a rule for one stage. This suggests that there is a bug in the rule importer if it is correct to have multiple stages in a rule file.

As an aside, I also note that the hoverfly rule above is incorrectly formed for Record Cleaner, missing the DataFieldName setting from the metadata section.

With regard to supply dates in the metadata section and the data section, the specification says "Both the date in the Metadata and in the Data section are tested if supplied." I'd expect a rule might look like the following.

[Metadata]
...
startDate = 0301
endDate = 0831
[EndMetadata]

[Data]
Stage = Larval
startDate = 0301
endDate = 0630

Stage = Adult
startDate = 0601
endDate = 0831

The implementation should require that the general test AND the stage test should both pass. I don't know what the outcome should be if the stage provided with the record does not match any stage in the rules.

johnvanbreda commented 3 years ago

@JimBacon the fact that Record Cleaner is broken when PeriodWithinYear rules have stage information explains why the stage data I've seen in the rule files seems inconsistent - it's probably never been tested.

I think if the stage does not match any stage provided in the rules, then we can only test the record date against the dates in the metadata section. This will be the case for many records which don't have stage data as well, or where the stage term is not a match for the term used in the rule file.

What I'm slightly less sure about is whether the way Indicia handles rule files where there are multiple stage/date rules in the [Data] section. We don't assume the order is important in the [Data] section so linking the date range to the correct Stage entry will need some work. Are you aware if there are actually any rules formatted with multiple stage sections in the [Data]?

JimBacon commented 3 years ago

It may not be quite what was envisaged for stages, but the Spider Recording Scheme gives multiple entries in the [Data] section to provide different date ranges for adult males and adult females. E.g.

[Metadata]
TestType=PeriodWithinYear
Group=Spiders Identifiable Season
ShortName=Helophora insignis: check date is during identifiable season
Description=Check date is during known identifiable season for Helophora insignis
tvk=NBNSYS0000009229
ErrorMsg=Date is outside the normal season for adult or identifiable Helophora insignis
DataFieldName=Stage
StartDate=0617
EndDate=1231
LastChanged=20120311
[EndMetadata]

[Data]
Stage=Adult Male,Male
StartDate=0724
EndDate=1130

Stage=Adult Female,Female
StartDate=0718
EndDate=1130

You can see this rule at http://warehouse1.indicia.org.uk/index.php/verification_rule/edit/4357 and observe how the "Other Data" has not been imported correctly, only showing the female stage.

johnvanbreda commented 3 years ago

@JimBacon there is a field in verification_rule_data called data_group, with the following description:

If a data section includes several groups of related data values, then identifies the group. E.g. if a period in year rule data includes several stage and start/end date pairs, this value groups them together.

This makes me think that the rule format in this example should be supported, but something is wrong with the importer so it doesn't work. Also I can't see any reference in the PeriodWithinYear code to the data_group field so that will need to be checked as well. Are you OK for me to fix these things in the code, or do you want to take a look yourself Jim?

JimBacon commented 3 years ago

OK by me.

johnvanbreda commented 3 years ago

@JimBacon done in the develop branch - here's the commit: https://github.com/Indicia-Team/warehouse/commit/15da079361fcd783f210e69d34c10e27ef782c40

johnvanbreda commented 2 years ago

@JimBacon this code is now live, but do we need to re-import some of the rule data to overwrite the incorrectly imported ones?

JimBacon commented 2 years ago

Yes, although I can't imagine we will see much difference after doing so. I am just completing a fix for https://github.com/Indicia-Team/warehouse/issues/390 which also necessitates a re-import so it would be sensible to wait until that fix is in place.

JimBacon commented 2 years ago

Also need to fix https://github.com/Indicia-Team/warehouse/issues/395 before re-import.

JimBacon commented 2 years ago

Above two fixes are now released and installed on the servers. I will try a re-import on the dev-warehouse and confirm it is working as intended.

JimBacon commented 2 years ago

I re-imported the Spider Recording Scheme and the Hoverfly Recording Scheme rules on the dev warehouse as they are known to have stage information in their period_within_year rules.

The Hoverfly records in the verification* tables updated as expected. The Spider rules created a new set of records in the verification* tables as the source_url and source_filename had changed, with the result that it was not recognised as an update.

The old Spider rules were removed with

update verification_rules
set deleted = true
where source_url like '%SRS%'
and updated_on < '2013-01-01';

update verification_rule_metadata vrm
set deleted = true
from verification_rules vr 
where vr.id = vrm.verification_rule_id
and  vr.source_url like '%SRS%'
and vrm.updated_on < '2013-01-01';

update verification_rule_data vrd
set deleted = true
from verification_rules vr 
where vr.id = vrd.verification_rule_id
and vr.source_url like '%SRS%'
and vrd.updated_on < '2013-01-01';

delete from cache_verification_rules_period_within_year cv
using verification_rules vr
where vr.id = cv.verification_rule_id
and vr.deleted = true;

delete from cache_verification_rules_identification_difficulty cv
using verification_rules vr
where vr.id = cv.verification_rule_id
and vr.deleted = true;

delete from cache_verification_rules_without_polygon cv
using verification_rules vr
where vr.id = cv.verification_rule_id
and vr.deleted = true;

The cache_verification_rules_period_within_year table is not updated for either recording scheme. In fact it is possible none of the cacheverification* tables are being updated. https://github.com/Indicia-Team/warehouse/issues/397

JimBacon commented 2 years ago

With the above bug fixed, I've re-imported the spider and hoverfly rules on the dev-warehouse and all looks well. I've submitted records via the dev-irecord site and seen them checked and flagged as expected.

Note that, for this to work, the records must be gathered with a stage attribute or a sex-stage attribute which shares the vocabulary used in the rule definition. E.g. The spider recording scheme has rules for stages 'Adult Male' and 'Adult Female' so a spider recording form should contain a sex-stage attribute offering those choices.

kitenetter commented 2 years ago

@JimBacon I'm not sure where this has got to - is there more work to do to enable the use of the stage terms for applying rules?

JimBacon commented 2 years ago

I believe the code on the warehouse would now support stage-related date ranges for PeriodWithinYear rule checks.

This issue is asking us to create rules which include life stage, notably for butterflies, and I have had fairly lengthy correspondence with Les Hill about doing this. There were repercussions which were not immediately obvious.

One positive outcome of the work was that I did finally work out how to install the latest version of RecordCleaner and how to format rules and record files so that it would correctly apply rules with stages.

kitenetter commented 2 years ago

Not sure if I'm going over ground that @JimBacon and Les have already discussed, but to me it seems that the optimal behaviour would be:

I would want us to avoid putting green ticks against records that have no applicable rules, see #746

Realise that this isn't straightforward to achieve!

JimBacon commented 2 years ago

To date, we have constrained ourselves to replicating the behaviour of RecordCleaner and following the specification in https://data.nbn.org.uk/recordcleaner/documentation/NBNRecordCleanerRuleGuide.pdf as best we can understand it. I think your suggestion would take us away from that.

kitenetter commented 1 year ago

Agreed to revisit later in 2022 and to link with any NBN plans for a refresh of the Record Cleaner software.

JimBacon commented 1 year ago

I have reviewed the comments from 11th May above and they continue to describe the current status.