AtlasOfLivingAustralia / DataQuality

Data Quality
0 stars 0 forks source link

Assess Data Quality #40

Open M-Nicholls opened 4 years ago

M-Nicholls commented 4 years ago

Tests and assertions that provide information as to the "quality" of the data

implement tests and assertions developed from TDWG/GBIF working group

M-Nicholls commented 4 years ago

review existing tests and assertions e.g. invalid institutionCode and invalid collectionCode are not appropriate for NSW bioNet data. It's probably doing a look up on the collections and institutions in the collectory for the check

e.g. https://biocache.ala.org.au/occurrences/38b7d97c-4d75-4de4-9064-f90edbed9b32 https://biocache.ala.org.au/occurrences/763dad83-3eb6-464d-9068-c0907e4927e7

institutionCode = The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. Current Eg ‘NSW Office of Environment and Heritage’ ownerInstitutionCode = The name (or acronym) in use by the institution having ownership of the object(s) or information referred to in the record. Current Egs. ‘Office of Environment and Heritage’, ‘Birds Australia’, ‘Australian Museum’

It's also a double negative on the test result - fail for "code not recognised"

M-Nicholls commented 4 years ago

related infrastructure issue: https://github.com/AtlasOfLivingAustralia/la-pipelines/issues/13

M-Nicholls commented 4 years ago

This GeospatialKosher issue has been resolved in this issue: https://github.com/AtlasOfLivingAustralia/biocache-store/issues/375 fix GeospatialKosher:

APPENDIX 2: Definition of “Spatially valid”

We don't have an easily accessible definition of "Spatially valid" so I figured this needed to be addressed and amended. The definitions of the tests that the ALA runs can be found https://biocache-ws.ala.org.au/ws/assertions/codes. The current formal definition of "Spatially valid" is isFatal: true and code < 10000, that is the following tests/flags

  1. Supplied coordinates are zero
  2. Suspected outlier
  3. Unable to convert UTM coordinates
  4. Zero latitude
  5. Unparseable verbatim coordinates
  6. Coordinates centre of country
  7. Geospatial issue ?? No idea what this means
  8. Coordinates are out of range for species ?? No idea what this means
  9. Outside expert range for species
  10. Supplied coordinates centre of state
  11. Decimal latitude/longitude conversion failed
  12. Zero longitude
  13. Habitat incorrect for species (presume marine species on land or vice-versa)

I would not have included 2, 6, 8, 9 or 10 as FATAL.

Tasilee commented 4 years ago

See also https://github.com/AtlasOfLivingAustralia/biocache-store/issues/375 for a parallel discussion.

M-Nicholls commented 4 years ago

• Sensitive and generalised records o Is this interacting with the filters – habitat and coordinate uncertainty  Is the coordinate uncertainty taken into account in the habitat mismatch test?  Is generalisation of records moving their habitat and so they fail the habitat mismatch test?

o Data points on the coastline can get flagged as wrong habitat,

M-Nicholls commented 4 years ago

Is coordinates centre of country working as an exclusion filter? See record where there appears to be a NT observation of a SA/Vic spp: https://biocache-dq-test.ala.org.au/occurrences/search?taxa=Caladenia+concolor#tab_mapView - should have failed the tests applied by the filters.

https://biocache.ala.org.au/occurrences/search?q=*:*&fq=assertions%3A%22coordinatesCentreOfCountry%22#tab_mapView

is not working (possibly too precise)

Tasilee commented 4 years ago

For https://github.com/AtlasOfLivingAustralia/DataQuality/issues/40#issuecomment-630473632, TG2 decided not to incorporate dwc:coordinateUncertaintyInMetres and dwc:coordinatePrecision but we have agreed that spatial buffers are required.

 Is the coordinate uncertainty taken into account in the habitat mismatch test?

Suggest no but buffer yes.

 Is generalisation of records moving their habitat and so they fail the habitat mismatch test?

Probably, but it is a warming so some false positives must be anticipated. Needs to be documented so if the test is applied, users can determine the applicability to their work.

o Data points on the coastline can get flagged as wrong habitat,

A buffer will reduce the false piositives.

This applies to centre of country/state/territory: use buffer.

elywallis commented 4 years ago

My two cents is that having a buffer would certainly improve things but having a reference list that allows a species to be both marine and terrestrial might help even more - plenty of examples of species that can be found in both environments (penguins, seals, crocodiles, most shore birds etc)

Do we have (or could we do) any analysis of which species are most commonly flagged as 'out of their habitat' or as habitat mismatch so we can see what the spread of species actually is? If someone can generate the list I'd be happy to do the analysis

Tasilee commented 4 years ago

@elywallis : See https://github.com/tdwg/bdq/issues/51#issuecomment-625572639 etc

elywallis commented 4 years ago

@Tasilee yep, what they all said ;-) especially Dave Watts

M-Nicholls commented 4 years ago

possibly implement as a stand-alone service

M-Nicholls commented 3 years ago

Date assertions need improvement, although this is mostly related to the data parsing rather than the tests. e.g. different date formats or data ranges not parsing correctly need a date not supplied assertion the invalid and incomplete date assertions don't work as expected

M-Nicholls commented 3 years ago

Feedback from MEL herbarium

Coordinates outside country • This report appears to only use mainland Australian boundaries and ignores dependencies so, for example, all records from Macquarie Island are erroneously reported as having coordinates outside the country. • Canary Islands specimens are listed as being outside the country, probably because they're being delivered with Spain as the political country, but obviously they don't map within the borders of Spain proper. Coordinates outside state • Most of these were close to the border, and occurred where the geocode only had minutes, but not seconds. With combination of the geocode and the uncertainty radius would usually place them in the correct state, even if the dot itself was over the border, so it would be good to include the uncertainty radius in this calculation. Also, we wondered if the datum was factored in and the geocode translated accordingly? Coordinates in centre of state • All the ones checked to date were okay; the localities just happened to be places that were in the centre of the state. It would be good if this report could somehow test against the text locality description as well as the state to avoid flagging too many records as potentially suspect.

RobinaSanderson commented 3 years ago

Hi @M-Nicholls - For the first one in "Coordinates outside country" we've been discussing this a bit in the pipelines project. It appears that the layers we use to determine this sometimes treats non-contiguous territories as other countries. For example Christmas Island has an ISO country code and so is in our list of recognised countries (although it is a territory of Australia. Currently, we check the coordinates in the data given to us, check the layer to see what country the coordinates are in, and determine the country from that. So where we have a Christmas Island record with the Country name supplied as Australia, we give it a processed Country name of Christmas Island.

We have similar problems with the Faroe Islands and Gibraltor. Not sure if this is the same problem for Macquarie Island.

By treating Christmas Island as it's own country, it does allow the centre of country check to run against the centre of Christmas Island rather than the centre of mainland Australia.

The pipelines project won't be fixing this.

GBIF are looking to implement the GADM country and region layers (primarily so that some of their data users can easily get species distributions for those regions from GBIF). Not sure if using the GADM layers would fix this issue.

Tasilee commented 3 years ago

See https://github.com/tdwg/bdq/issues/56

M-Nicholls commented 3 years ago

coordinate precision mismatch:

that test is checking whether the number of decimal places provided in the coordinates matches the provided coordinate precision. When the test was written we were expecting something like 0.01, 0.001, 0.0001 etc for the coordinate precision so we could basically check the number of decimal places in the coordinates. However in data provided as degrees, minutes, seconds that doesn't make sense where (as in this case) the correct precision for the nearest minute is 0.01667. Basically the test is over simplified and doesn't isn't working correctly for precision provided as other than the simple 0.1, 0.01 etc.

M-Nicholls commented 3 years ago

state_coordinate_mismatch not working:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/2bed497d30c2ea2d5240724736df0f35bd7cd36e/src/main/scala/au/org/ala/biocache/processor/LocationProcessor.scala#L721

E.g. https://biocache.ala.org.au/ws/occurrences/f86c36b7-61be-47f4-95d6-f7e4d460e8f5.json should be flagged as the records is in the ocean. It's not flagged because there is no processed value for state even though a raw state value is provided