Open M-Nicholls opened 4 years ago
review existing tests and assertions e.g. invalid institutionCode and invalid collectionCode are not appropriate for NSW bioNet data. It's probably doing a look up on the collections and institutions in the collectory for the check
e.g. https://biocache.ala.org.au/occurrences/38b7d97c-4d75-4de4-9064-f90edbed9b32 https://biocache.ala.org.au/occurrences/763dad83-3eb6-464d-9068-c0907e4927e7
institutionCode = The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. Current Eg ‘NSW Office of Environment and Heritage’ ownerInstitutionCode = The name (or acronym) in use by the institution having ownership of the object(s) or information referred to in the record. Current Egs. ‘Office of Environment and Heritage’, ‘Birds Australia’, ‘Australian Museum’
It's also a double negative on the test result - fail for "code not recognised"
related infrastructure issue: https://github.com/AtlasOfLivingAustralia/la-pipelines/issues/13
This GeospatialKosher issue has been resolved in this issue: https://github.com/AtlasOfLivingAustralia/biocache-store/issues/375 fix GeospatialKosher:
APPENDIX 2: Definition of “Spatially valid”
We don't have an easily accessible definition of "Spatially valid" so I figured this needed to be addressed and amended. The definitions of the tests that the ALA runs can be found https://biocache-ws.ala.org.au/ws/assertions/codes. The current formal definition of "Spatially valid" is isFatal: true and code < 10000, that is the following tests/flags
I would not have included 2, 6, 8, 9 or 10 as FATAL.
See also https://github.com/AtlasOfLivingAustralia/biocache-store/issues/375 for a parallel discussion.
• Sensitive and generalised records o Is this interacting with the filters – habitat and coordinate uncertainty Is the coordinate uncertainty taken into account in the habitat mismatch test? Is generalisation of records moving their habitat and so they fail the habitat mismatch test?
o Data points on the coastline can get flagged as wrong habitat,
Is coordinates centre of country working as an exclusion filter? See record where there appears to be a NT observation of a SA/Vic spp: https://biocache-dq-test.ala.org.au/occurrences/search?taxa=Caladenia+concolor#tab_mapView - should have failed the tests applied by the filters.
is not working (possibly too precise)
For https://github.com/AtlasOfLivingAustralia/DataQuality/issues/40#issuecomment-630473632, TG2 decided not to incorporate dwc:coordinateUncertaintyInMetres and dwc:coordinatePrecision but we have agreed that spatial buffers are required.
Is the coordinate uncertainty taken into account in the habitat mismatch test?
Suggest no but buffer yes.
Is generalisation of records moving their habitat and so they fail the habitat mismatch test?
Probably, but it is a warming so some false positives must be anticipated. Needs to be documented so if the test is applied, users can determine the applicability to their work.
o Data points on the coastline can get flagged as wrong habitat,
A buffer will reduce the false piositives.
This applies to centre of country/state/territory: use buffer.
My two cents is that having a buffer would certainly improve things but having a reference list that allows a species to be both marine and terrestrial might help even more - plenty of examples of species that can be found in both environments (penguins, seals, crocodiles, most shore birds etc)
Do we have (or could we do) any analysis of which species are most commonly flagged as 'out of their habitat' or as habitat mismatch so we can see what the spread of species actually is? If someone can generate the list I'd be happy to do the analysis
@elywallis : See https://github.com/tdwg/bdq/issues/51#issuecomment-625572639 etc
@Tasilee yep, what they all said ;-) especially Dave Watts
possibly implement as a stand-alone service
Date assertions need improvement, although this is mostly related to the data parsing rather than the tests. e.g. different date formats or data ranges not parsing correctly need a date not supplied assertion the invalid and incomplete date assertions don't work as expected
Feedback from MEL herbarium
Coordinates outside country • This report appears to only use mainland Australian boundaries and ignores dependencies so, for example, all records from Macquarie Island are erroneously reported as having coordinates outside the country. • Canary Islands specimens are listed as being outside the country, probably because they're being delivered with Spain as the political country, but obviously they don't map within the borders of Spain proper. Coordinates outside state • Most of these were close to the border, and occurred where the geocode only had minutes, but not seconds. With combination of the geocode and the uncertainty radius would usually place them in the correct state, even if the dot itself was over the border, so it would be good to include the uncertainty radius in this calculation. Also, we wondered if the datum was factored in and the geocode translated accordingly? Coordinates in centre of state • All the ones checked to date were okay; the localities just happened to be places that were in the centre of the state. It would be good if this report could somehow test against the text locality description as well as the state to avoid flagging too many records as potentially suspect.
Hi @M-Nicholls - For the first one in "Coordinates outside country" we've been discussing this a bit in the pipelines project. It appears that the layers we use to determine this sometimes treats non-contiguous territories as other countries. For example Christmas Island has an ISO country code and so is in our list of recognised countries (although it is a territory of Australia. Currently, we check the coordinates in the data given to us, check the layer to see what country the coordinates are in, and determine the country from that. So where we have a Christmas Island record with the Country name supplied as Australia, we give it a processed Country name of Christmas Island.
We have similar problems with the Faroe Islands and Gibraltor. Not sure if this is the same problem for Macquarie Island.
By treating Christmas Island as it's own country, it does allow the centre of country check to run against the centre of Christmas Island rather than the centre of mainland Australia.
The pipelines project won't be fixing this.
GBIF are looking to implement the GADM country and region layers (primarily so that some of their data users can easily get species distributions for those regions from GBIF). Not sure if using the GADM layers would fix this issue.
coordinate precision mismatch:
that test is checking whether the number of decimal places provided in the coordinates matches the provided coordinate precision. When the test was written we were expecting something like 0.01, 0.001, 0.0001 etc for the coordinate precision so we could basically check the number of decimal places in the coordinates. However in data provided as degrees, minutes, seconds that doesn't make sense where (as in this case) the correct precision for the nearest minute is 0.01667. Basically the test is over simplified and doesn't isn't working correctly for precision provided as other than the simple 0.1, 0.01 etc.
state_coordinate_mismatch not working:
E.g. https://biocache.ala.org.au/ws/occurrences/f86c36b7-61be-47f4-95d6-f7e4d460e8f5.json should be flagged as the records is in the ocean. It's not flagged because there is no processed value for state even though a raw state value is provided
Tests and assertions that provide information as to the "quality" of the data
implement tests and assertions developed from TDWG/GBIF working group