AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

countryConservation and stateConservation fields populated incorrectly #769

Closed sat01a closed 8 months ago

sat01a commented 1 year ago

I would have thought that these were the lists that informed the population of the stateConservation and countryConservation fields (via pipelines indexing). I'm told we had the metadata wrong and that the lists need to be flagged isAuthoritative and isThreatened and would appreciate confirmation. Does the region need to be set as well?

Currently the values in countryConservation (for all occurrences) are:

countryConservation occ count
Not Listed 6,408,974
Vulnerable 285,558
Endangered 193,209
Critically Endangered 23,657
Extinct 223

whereas the distinct values in the EPBC list are:

status Count
Vulnerable 794
Endangered 740
Critically Endangered 311
Extinct 104
Conservation Dependent 8
Extinct in the wild 1

We're about to go to a conference and do a talk on how we index threatened species, so it would be good if we could get this working.

sat01a commented 1 year ago

Added to the board - cc @peggynewman

djtfmartin commented 1 year ago

I believe this is in part related to this bug: https://github.com/gbif/pipelines/issues/793

This is fixed in this release of pipelines: https://github.com/gbif/pipelines/releases/tag/pipelines-parent-2.11.12.3

sat01a commented 1 year ago

Thanks @djtfmartin , is the above version deployed to prod?

javier-molina commented 1 year ago

Only in databox and databox dev atm @sat01a

Databox/test Solr in data has been refreshed recently using pipelines 2.11.12.3. You might want to test if this bug still occurs in test.

See https://github.com/AtlasOfLivingAustralia/preingestion/issues/71 for more details about pipelines versions.

adam-collins commented 1 year ago

Only dr368 in the test environment contains countryConservation: Not Listed. Unsure why this is the case as it was processed and loaded after the species list dr656 was updated on lists prod and test. Investigating.

A comparison between test and production:

test prod
data resources with any countryConservation value 213 1 (dr368)
taxonConceptID with any countryConservation value 9177 11948
taxonConceptID with any countryConservation value other than "Not Listed" 1930 514
peggynewman commented 1 year ago

Do we have a raw_ and processed value? I guess if Bionet (dr368) or any data resource is passing us that value I'd prefer that it only contained values from the EPBC list (ie dr656)

adam-collins commented 1 year ago

Putting this on hold until after production pipelines is upgraded since it may already be fixed.

adam-collins commented 1 year ago
Status and counts for facet countryConservation. This is populated from the status value in the species lists. e.g. Status Count
Not Listed 6920694
Migratory 1892737
Vulnerable 1051252
Endangered 731931
Critically Endangered 289599
Priority 4: Rare, Near Threatened 197764
Priority 1: Poorly-known species 150067
Priority 2: Poorly-known species 125983
Priority 3: Poorly-known species 103300
Conservation Dependent 98169
Other Specially Protected 56762
Extinct 7916
Extinct in the wild 144

raw_countryConservation appears to be empty. It should be populated from the sourceStatus value in the species lists.

raw_stateConservation appears to be correctly fetching the values in the sourceStatus column. e.g. https://biocache.ala.org.au/ws/occurrences/search?q=species_list_uid:dr653&facet=true&facets=raw_stateConservation&pageSize=0

stateConservation also appears to be working. https://biocache.ala.org.au/ws/occurrences/search?q=species_list_uid:dr653&facet=true&facets=stateConservation&pageSize=0

adam-collins commented 1 year ago

I suspect that the Priority* values in the countryConservation field are from https://lists.ala.org.au/speciesListItem/list/dr2201 which was last updated 14/6. It now contains the region Western Australia and I hope that the pipeline was run an a previous version of the list.

adam-collins commented 1 year ago

It may be required to update status field of conservation lists. status should contain a processed value.

adam-collins commented 1 year ago

I can confirm that raw_countryConservation is not populated by the pipeline. Is this required?

peggynewman commented 1 year ago

The conservation lists used to have a status and a sourceStatus value because we attempted to map values onto a vocabulary approximating the IUCN Red List, so retained the mapping in these fields. This is no longer required because most states have now worked that into their lists themselves. The field sourceStatus might still be in the lists but I think we should remove the field State conservation (unprocessed) from the facets. Note that these are not Darwin Core terms, they are in ALA namespace. Having said all this, BioNet provide us with stateConservation and countryConservation in their data, and provide this 'Not Listed' value, which is true for NSW, but misleading in our data. My suggestion is to retain any provided value as raw_stateConservation and raw_countryConservation, and the values shouldn't appear on the UI.

peggynewman commented 1 year ago

@adam-collins doh, sorry I just realised the issue around countryConservation. Really the only values in this should be coming from the EPBC list https://lists.ala.org.au/speciesListItem/list/dr656 which I thought was:

adam-collins commented 1 year ago

@peggynewman dr2201 was matching to Australia because it had 2 spaces between Western and Australia. Fixed on test and prod. Future matches will correctly match to the state instead of the country.

adam-collins commented 1 year ago

Moved the issue to https://github.com/gbif/pipelines/issues/920

peggynewman commented 1 year ago

Holy cow, good catch @adam-collins ping @rosemaryjoconnor

rosemaryjoconnor commented 1 year ago

@adam-collins oh wow excellent catch Adam. Thankyou

peggynewman commented 1 year ago

@adam-collins this is still a problem in production - ie WA values are still appearing in the countryConservation field. A full ingest ran last night and I would have expected that would have resolved the problem. What do you think?

peggynewman commented 10 months ago

I want to close this issue and reopen another.

@adam-collins See here, a query on the Qld Conservation Statuses list (species_list_uid:dr652) is picking up a raw_conservationStatus of Representative record. Should I raise this in pipelines?

https://biocache.ala.org.au/occurrences/search?q=species_list_uid%3Adr652&qualityProfile=ALA&qc=-_nest_parent_%3A*&fq=raw_state_conservation%3A%22R%22#tab_mapView

Also weird: Near threatened values aren't getting through to biocache from the list at all.

image
adam-collins commented 9 months ago

An update to the requirements of this issue.

adam-collins commented 9 months ago

Pull request https://github.com/gbif/pipelines/pull/929

adam-collins commented 8 months ago

Merged into 2.18.0-SNAPSHOT

peggynewman commented 6 months ago

raw_stateConservation is now gone - I suggest we remove it from the facet list. stateConservation and countryConservation are looking great.

adam-collins commented 6 months ago

Removed raw_stateConservation in pull request https://github.com/AtlasOfLivingAustralia/ala-install/pull/774

adam-collins commented 5 months ago

Testing can be done by having a data resource with the default values for stateConservation and contryConservation. After reingesting SOLR should have these values in the raw_stateConservation and raw_countryConservation fields. Currently there are none in test. Or it is not working.

peggynewman commented 1 month ago

All good. Please removed raw_stateConservation from the facet list.