Communicating data set quality

cookeac commented 2 years ago

When interchanging a map layer, it can be helpful for the recipient to understand the:

Provenance (generated by whom, and when)
Scale
Processing applied
Applicability (intended purposes, or what it might be suitable for).

It would make sense to attach this information to a feature catalog item (either a OCG feature catalog entry, or more specifically to a holding-level FeatureCatalogItemResource in this schema).
This information could also be attached to a feature collection or even individual features when these are served, but it makes more sense to use a feature catalog because that way decisions about applicability can be made before the feature collection is fetched.
Where possible, use should be made of the Geographic Information Data Quality Metadata Standard in ISO 19157 (licenced document, hence not linked). There is however a precis of this standard available and a more general document here.

Key data quality attributes are:

Completeness (metrics of omission and commission errors from a data evaluation)
Logical Consistency (measures of format, topology, domain, and conceptual consistency)
Thematic Accuracy (metrics of attribute correctness, particularly in classification of attributes)
Temporal Quality (accuracy of time measurement and temporal consistency and validity)
Positional Accuracy (resolution, accuracy of represented positions vs real world)
Meta Quality (confidence, homogeneity, and representativity)
Conformance (conformance with a standard or agreed profile)

Many of these require a formal QA review (using ISO 19158) to populate, which I suspect won't happen with most farm-level data, except perhaps for data sets compiled formally at national or regional level.

It seems that the most relevant to farm scale data might be:

Data Quality - Conformance - to an agreed specification, naming the specification, and with a boolean True/False for conformance (would require agreed specifications)
Spatial Resolution - Vector Spatial Representation - topology level code, scale denominator or distance equivalent
Provenance - Derivation - derivation method
Provenance - Acquisition - acquisition platform or instrument (though these are really designed for satellite data etc)

Feedback needed: Could those who are interested in data quality metrics please comment with their needs/thoughts?

LynkerAnalyticsGordonMorris commented 2 years ago

When talking about data quality, I use a helpful metadata element in the ANZLIC Metadata - “lineage”. This allows a description of how the source data was found, created, and developed. While its an optional obligation, I use it in the production of my imagery and encourage it for use in the Artificial Intelligence / Machine Learning environment.

As noted in the ANZLIC Metadata user guide: “This field should be used to indicate whether the data are observations, analyses (re-analyses), forecast (based on initial states including observations), simulations or other sources of data. It could also be used to include the platform/mission in the source of data (e.g. Ship, aircraft, satellite, satellite id). There may be a need to use pairs [source, processing step] to provide additional information. May contain references (e.g. URL) to external information on the processing and source”

I like how it's easy to read, and folk can put as much or as little jargon as they like, as long as it reads well. This would imply it could be machine-read in the future.

I can see "lineage" (or indeed the "abstract") being used at a plot level, as an example "soil condition inferred from fertilizer placement data supplied by SpreadTheGood Ltd on 4 May 2022 under ideal placement conditions with a 1 second time interval (approximately 5m) then rasterized to a 5 metre grid to match existing data holdings").

For example in some of our current work, we have the following lineage statement: "Lineage. Lynker Analytics support to Toitū Te Whenua Land Information New Zealand includes the provision of datasets derived from Artificial Intelligence. This includes layers derived from Machine Learning (ML), which can be described as mathematical models that generate predictions of features from imagery. The pixel segmentation maps generated from the ML models are vectorised autonomously using a rules-based process. The polygons are validated using a human captured test data set with semi-automated quality control while human review and editing are used for a small percentage that fail automated acceptance criteria."

We then have a statement on quality assurance for the ANZLIC Metadata profile: Quality Assurance. Lynker Analytics use a rigorous quality assurance process, including imagery and geometry review. Polygons in several training areas were captured to the LINZ specification by a human operator. This “truth” data set was used to finetune pre-trained models held by Lynker Analytics. A proportion of the objects are withheld from training to be used as quality assurance for the derived building data layer. Separate neural network models are then used in a QA network. Image classification neural networks automate the assessment of captured building polygons against the source imagery. Captured polygons that fall outside accuracy thresholds relative to prior building capture or that are identified by our QA network will trigger final human review. Building placement is within a 4 pixel tolerance of the absolute position of the building with respect to the imagery provided.

Users may refer to the [ Data Dictionary where users can link to] for detailed information about the criteria used to define this dataset.

In the above instance we've not gone into great detail about the type of tools and Lynker Analytics procedures use, but it gives enough for an average end user to assess that this data itself has been through a quality assurance process.

cookeac commented 2 years ago

Brilliant feedback @LynkerAnalyticsGordonMorris, add I love the examples.

So key pieces to think about at inter-company farm scale integration are:

Should Lineage and other methods be used at the dataset or the element MetaData level?
How do we convert lineage or QA statements to machine readable elements that can be assessed in (near) real time when integrating software systems.

cookeac commented 2 years ago

Thanks everyone for a good discussion of this at our meeting on 4th May 2022.

One of the outcomes was that the Rezare Systems team would do some research into existing metadata specifications that met your needs. The obvious starting point was ISO 19115-(1,2,3). We also looked at the AS/NZS spatial metadata standard, which is a profile of ISO 19115-2. Finally, we looked at the draft OGC API - Records which is a specification for a JSON API for catalog metadata services.

OGC API - Records was the most attractive because of its fit with this project, but it is the least complete. It currently only includes a subset of the most core data (title, links, extent, crs) and a few metadata fields.

In order to meet your stated needs we would likely need:

Resource Information such as abstract, spatial and temporal extents
Resource Lineage information
Spatial Resolution information

The Alaska Data Integration Working Group has done some excellent work in creating a JSON Schema version of the ISO19115-x specifications, and this looks like a useful route for us. However, ISO 19115 is a large standard, so while we want to have the broad specification available to support maximum use, it may still be appropriate to agree a subset of required fields (by documentation) that will be used to support automated interchange of data between agricultural applications.

Datalinker-Org / Geospatial

Communicating data set quality #27