NIEHS / beethoven

BEETHOVEN is: Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
https://niehs.github.io/beethoven/
Other
5 stars 0 forks source link

AQS data: Explore metadata and data quality #7

Closed MAKassien closed 1 year ago

MAKassien commented 1 year ago
  1. What metadata is provided in the AQS file?
  2. Is there metadata pertaining data quality? (Crossover with issue #8)
sigmafelix commented 1 year ago

This issue is closely related to what I am doing. I will assign myself to this issue. A document I will share here includes "glossary," which covers details on data quality in metadata. A list of metadata to explore is as below:

sigmafelix commented 1 year ago

I put up the glossary part here for everyone's quick reference.

Measurement methods

EPA AQS data terminology

Measurement terms

Quality terms

sigmafelix commented 1 year ago

I will list up studies on missing (related to #8 ) data treatment. Papers listed here would be added to the reference list in #10 .

Focus

Table

Study Missing treatment POC Spatial/temporal
Aguilera et al. 2023 Imputed with missRanger Not described Manually selected "regional representative site" and included them in the performance assessment
Meng et al. 2018 Not described Not applicable (PM2.5 speciation data) "Spatial" cross-validation based on 10-fold random split (actually not a spatial CV); daily split for temporal assessment
Zhang et al. 2017 Monitors with 20+% coverage were selected; no "coverage" definition was provided Not described Exploratory study; four zones (the Northwest, the Northeast, the Southeast, and California) and AQS designated urban-rural classification
sigmafelix commented 1 year ago

My latest comment in #8 describes metadata and data quality through glossary and workflow. Will add details of negative values (below minimum detection limit) from the daily data.

sigmafelix commented 1 year ago

Negative values for PM2.5 data in 1,060 sites during the study period:

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean   sd p0 p25 p50   p75 p100 hist 
1 p_negative            0             1 0.393 1.68  0   0   0 0.106 26.9 ▇▁▁▁▁

332 sites reported negative values in the Arithmetic.Mean field. The mean is 0.39 percent and the maximum is 26.90 percent. Nonzeros are distributed in the Great Plains to the Mountain West, inland northeast, and the California coast.

negative

sigmafelix commented 1 year ago

The pattern of the proportion of negative values make sense in that the negative values are associated with "below the minimum detection limit." Negative values report the range of [-8.000, -0.004], which, of course, lie in the acceptable range of AQS convention (-15 ug/m3). Negative values are mostly in [-2, 0] range. As each monitor has different device specifications, converting negative values into a single value might introduce biases in some sites (i.e., BAM sites). I think this is a discussion point for the next meeting.

image