Quality Control Implementation

Patowhiz commented 7 months ago

Overview

After reviewing the WMO CDMS specifications, I suggest developing the following quality control (QC) submodules to enhance our climate data management system:

Duplicate Data Check: To eliminate duplicate entries during data ingestion, preventing unnecessary redundancy.
Limits Check: During data ingestion, values outside the acceptable range will be flagged for review.
Source Check: To differentiate and validate identical data from various sources, designating the most reliable source as final.
Missing Data Check: To detect data gaps, facilitating informed decisions on handling these absences for subsequent analysis.
Internal Consistency Check: To verify the coherence of related data points within the dataset, such as temperature and dew point correlations. This check will include Same value, Jump value and Interelement checks.
Temporal Consistency Check: To identify abrupt temporal changes, distinguishing between potential errors and actual environmental shifts.
Spatial Consistency Check: To assess data across various locations, identifying spatial anomalies that may indicate localized discrepancies.
Extreme Value Check: To scrutinize and authenticate any extreme values or statistical outliers beyond the normal range.
Data Homogeneity Check: To correct biases from changes in observational methods or locations, especially vital for long-term climate studies.
Metadata Check: To investigate metadata for additional insights that may elucidate detected anomalies or inconsistencies.

I recommend constructing a QC workflow that processes these checks in a logical and efficient sequence, starting with simpler tasks and advancing to more complex analyses. While some steps may occur concurrently, the overall process should be iterative, ensuring a comprehensive and nuanced data quality assessment.

Furthermore, each QC step will be systematically logged in the observation model, which is specifically designed to accommodate these checks, enhancing transparency and traceability in data quality control.

Some of these checks could be user driven (manual) or system driven (automated) or semi automated depending on the nature of the quality control check.

Request for Comments

I invite feedback on this proposal. Your insights and suggestions will be invaluable.

Patowhiz commented 3 months ago

Quality Control Tests

After further evaluation of possible QC tests. I now suggest we break down the above generalised quality controls to distinct components that cover WMO publication 1131, section 5.3 on Observations Quality Control. Each test serves a specific purpose in verifying that the data behaves as expected under various conditions.

Range Threshold Test:
- Description: This test verifies that the value of a single climate element falls within a defined range. It checks both the lower and upper thresholds to ensure the data is within realistic and acceptable limits.
- Example: Ensuring that air temperature stays between -50°C and 50°C, or that humidity levels are between 0% and 100%.
Repeated Value Test:
- Description: This test specifically looks for exact repetitions of the same value in sequence. It's designed to catch situations where the data appears stuck on a single value without any variation. It’s useful for identifying stuck sensors or errors in data collection.
- Example: Identifying cases where wind speed records the exact same value consecutively over a day, which may indicate a sensor problem.
Flat Line Test:
- Description: This test detects more subtle issues where the sensor might be producing data that shows very minimal variation, indicating a potential issue with the sensor's ability to accurately measure changes. It's sensitive to patterns that might be missed by a strict repeated value test. It's more flexible in detecting patterns where the data might vary slightly but remains within a very narrow range that is effectively flat
- Example: Identifying cases where the sensor might be malfunctioning in a way that still produces slight variations but not enough to reflect actual environmental changes.
Spike Test:
- Description: This test checks for sudden, significant changes (spikes) in a climate element’s value between consecutive observations. It helps catch abrupt anomalies that may not be physically realistic.
- Example: Detecting an unexpected spike in temperature, such as a sudden increase of 10°C within a five-minute interval.
Relational Comparison Test:
- Description: This test compares the value of one climate element against another to ensure that the expected relationship is maintained. It checks whether one value is greater than, less than, or equal to another.
- Example: Ensuring that air temperature is always greater than or equal to the dew point temperature.
Diurnal Test:
- Description: This test ensures that the observed data reflects expected diurnal patterns, typically seen in temperature or solar radiation. It checks whether the daily cycle of rising and falling values is consistent with natural conditions.
- Example: Verifying that temperatures rise during the day and fall at night, following typical diurnal variation.
Contextual Consistency Test:
- Description: This test checks that the value of one climate element is contextually consistent with another. It verifies that certain conditions logically influence other related elements.
- Example: When cloud cover is heavy (e.g., a cloud value of 8), ensuring that sunshine values are low or zero, reflecting realistic atmospheric conditions.
Remote Sensing Consistency Test:
- Description: This test compares ground station data with values obtained from remote sensing technologies like radar or satellite. It ensures consistency between in-situ measurements and broader environmental observations captured through remote sensing.
- Example: Checking that rainfall measurements from a ground station are consistent with radar-based precipitation estimates or that ground temperature measurements align with satellite-derived temperature data.
Spatial Consistency Test:
- Description: This test compares data from multiple nearby stations to ensure consistency. It checks whether values from different locations in close proximity are reasonably similar, considering the environmental context.
- Example: Verifying that temperature readings from stations within the same region are similar, adjusting for known microclimate differences.
Source Check: To differentiate and validate identical data from various sources, designating the most reliable source as final. This is useful for double data entry or when evaluating performance of different instruments.

Setting Parameters for Quality Control Tests

The parameters for these quality control checks should be carefully defined after heuristic evaluations have been performed by experienced quality control operators and climatologists. These experts should analyze historical data and current observations to determine the normal ranges and expected behavior of each climate element. This ensures that the thresholds and conditions set for each test are realistic and tailored to the specific environmental context. By establishing these parameters based on expert insights, we can enhance the accuracy and reliability of the data validation process, leading to more trustworthy climate data for analysis and decision-making.

Quality Control Tests Groupings per CDMS WMO Publication 1131 and How They Relate To Above Tests

5.3.1.1 Consistency checks - Relational Comparison Test, Diurnal Test and Contextual Consistency Test.
5.3.1.2 Data comparison - Remote Sensing Consistency Test.
5.3.1.3 Heuristic checks - These should be done first to define parameters for the tests. Continuous evaluations should also be done to monitor needs for parameter changes that can be caused by local environmental changes or climate change.
5.3.1.4 Statistical checks - Flat Line Test, Spike Test and Repeated Value Test.
5.3.1.5 Spatial checks- Spatial Consistency Test
5.3.1.6 Data recovery - All the tests will have a use interface that allows for data corrections.

Patowhiz commented 2 months ago

We could add the following tests as products in the products module.

Difference Threshold Test:

Description: This test checks whether the difference between the values of two elements is within a specified range. It ensures that the difference doesn't exceed or fall short of a critical threshold, maintaining the expected relationship between the elements.
Example: Verifying that the difference between maximum and minimum temperature within a day is realistic, such as not exceeding 40°C.

Summation Threshold Test:
Description: This test ensures that the sum of two climate element values does not exceed or fall short of a specified threshold. It’s useful when the combined value of two elements has a meaningful impact or represents a physical limit.
Example: Ensuring that the combined rainfall from two adjacent stations does not exceed a realistic total for a given period.

climsoft / climsoft-web