Open Patowhiz opened 7 months ago
After further evaluation of possible QC tests. I now suggest we break down the above generalised quality controls to distinct components that cover WMO publication 1131, section 5.3 on Observations Quality Control. Each test serves a specific purpose in verifying that the data behaves as expected under various conditions.
Range Threshold Test:
Repeated Value Test:
Flat Line Test:
Spike Test:
Relational Comparison Test:
Diurnal Test:
Contextual Consistency Test:
Remote Sensing Consistency Test:
Spatial Consistency Test:
Source Check: To differentiate and validate identical data from various sources, designating the most reliable source as final. This is useful for double data entry or when evaluating performance of different instruments.
The parameters for these quality control checks should be carefully defined after heuristic evaluations have been performed by experienced quality control operators and climatologists. These experts should analyze historical data and current observations to determine the normal ranges and expected behavior of each climate element. This ensures that the thresholds and conditions set for each test are realistic and tailored to the specific environmental context. By establishing these parameters based on expert insights, we can enhance the accuracy and reliability of the data validation process, leading to more trustworthy climate data for analysis and decision-making.
5.3.1.1 Consistency checks - Relational Comparison Test, Diurnal Test and Contextual Consistency Test.
5.3.1.2 Data comparison - Remote Sensing Consistency Test.
5.3.1.3 Heuristic checks - These should be done first to define parameters for the tests. Continuous evaluations should also be done to monitor needs for parameter changes that can be caused by local environmental changes or climate change.
5.3.1.4 Statistical checks - Flat Line Test, Spike Test and Repeated Value Test.
5.3.1.5 Spatial checks- Spatial Consistency Test
5.3.1.6 Data recovery - All the tests will have a use interface that allows for data corrections.
We could add the following tests as products in the products module.
Difference Threshold Test:
Example: Verifying that the difference between maximum and minimum temperature within a day is realistic, such as not exceeding 40°C.
Summation Threshold Test:
Overview
After reviewing the WMO CDMS specifications, I suggest developing the following quality control (QC) submodules to enhance our climate data management system:
Duplicate Data Check: To eliminate duplicate entries during data ingestion, preventing unnecessary redundancy.
Limits Check: During data ingestion, values outside the acceptable range will be flagged for review.
Source Check: To differentiate and validate identical data from various sources, designating the most reliable source as final.
Missing Data Check: To detect data gaps, facilitating informed decisions on handling these absences for subsequent analysis.
Internal Consistency Check: To verify the coherence of related data points within the dataset, such as temperature and dew point correlations. This check will include Same value, Jump value and Interelement checks.
Temporal Consistency Check: To identify abrupt temporal changes, distinguishing between potential errors and actual environmental shifts.
Spatial Consistency Check: To assess data across various locations, identifying spatial anomalies that may indicate localized discrepancies.
Extreme Value Check: To scrutinize and authenticate any extreme values or statistical outliers beyond the normal range.
Data Homogeneity Check: To correct biases from changes in observational methods or locations, especially vital for long-term climate studies.
Metadata Check: To investigate metadata for additional insights that may elucidate detected anomalies or inconsistencies.
I recommend constructing a QC workflow that processes these checks in a logical and efficient sequence, starting with simpler tasks and advancing to more complex analyses. While some steps may occur concurrently, the overall process should be iterative, ensuring a comprehensive and nuanced data quality assessment.
Furthermore, each QC step will be systematically logged in the observation model, which is specifically designed to accommodate these checks, enhancing transparency and traceability in data quality control.
Some of these checks could be user driven (manual) or system driven (automated) or semi automated depending on the nature of the quality control check.
Request for Comments
I invite feedback on this proposal. Your insights and suggestions will be invaluable.