AQS data: Explore metadata and data quality

MAKassien commented 1 year ago

What metadata is provided in the AQS file?
Is there metadata pertaining data quality? (Crossover with issue #8)

sigmafelix commented 1 year ago

This issue is closely related to what I am doing. I will assign myself to this issue. A document I will share here includes "glossary," which covers details on data quality in metadata. A list of metadata to explore is as below:

Annual summary: the required number of samples, the number of null values, the number of observations, etc. at each monitor -- need to be compared with what we get from the daily summaries
Daily summary: actual observation, device type, sample duration, etc. A visual summary of availability at each monitor can be found #8
Hourly sample data from AQS API (via RAQSAPI in R or direct POST calls): this data includes a column about the exact sampling frequency of monitors. Although we can basically derive the sampling frequency from the annual summary (i.e., ceiling((the required number of samples) / 365)), I think it should be cross-checked with the daily summary.
- Due to the EPA AQS API server settings, RAQSAPI or direct POST calls using httr (or latest httr2) do not work with error messages of "unsafe legacy renegotiation is disabled," which occurs in non-Windows (or Unix-like?) systems (link). I will try another way to automate get hourly samples from the API.

sigmafelix commented 1 year ago

I put up the glossary part here for everyone's quick reference.

Measurement methods

PM_2.5
- Gravimetric Measurement: particulates blown in a consistent airflow collide with a filter, then the weight difference between a blank filter and the measurement filter is measured in the laboratory.
  - WINS: Well-type Impactor Ninety-Six. particle size separator -> reference
  - VSCC: Very Sharp Cut Cyclone. particle size separator. Trademark of Met One -> reference
- Beta Attenuation Measurement (BAM): using beta ray, the difference in reflectance (attenuation) by particulates is measured.
- Laser Light Scattering: using laser. The principle is the same as that of BAM.
- Broadband spectroscopy: using broadband ultraviolet.
NO₂
- Chemiluminescence: measurement of NO₂ concentration with the light intensity of wavelengths greater than 600 nanometers (nm).
- Cavity Attenuated Phase-Shift Spectroscopy: "measuring the absorption of NO₂ at the wavelength of 450 nm" (reference)
- Photolytic-Chemiluminescence: photolytic converter or blue light converter. The dissociation of NO₂ into NO is measured to convert the concentration of NO₂ by using light-emitting diodes (LEDs) emitting at 395 nm (reference)

EPA AQS data terminology

EPA AQS Data Dictionary
Unique identifier construction: state code (federal information processing code; FIPS), county FIPS, site number, parameter code, and parameter occurrence code
- EPA suggests concatenating all codes (padded with zeros if converted to integer) with hyphens
  SS-CCC-NNNN-PPPPP-Q
  - where SS is state FIPS, CCC is county FIPS, NNNN is site number, PPPPP is parameter code, and Q is POC.
  - R code example: sprintf("%02d-%03d-%04d-%05d-%01d", state, county, site, parameter, poc) given all inputs are integers
  - I think hyphens are unnecessary; would like to ask for everyone's wisdom
- Parameter occurrence code: an exclusive code of a monitor in each site that is assigned when the same pollutant is measured by multiple monitors. EPA allows monitoring organizations to run multiple monitors at a site to reduce costs while fulfilling required sampling frequencies (AQS Tech Note 6-28-13)

Measurement terms

Sample duration: the difference between the start and end time of measurement.
- PM_2.5 data includes three types of sample durations (1- and 24-hour, and 24-hour block average)
- NO₂ data has 1-hour duration
Sampling frequency: monitors have different sampling schedules; 1-day, 3-day, 6-day, etc.
Observation count: the number of observations of each day to calculate the daily average
- When the sample duration is 1-hour and the observation count is less than 17 (note: the quality standard is 18 hours (75 per cent)), the data quality needs to be reassessed
  - There is possibility of reporting errors in observation counts
- The sample durations of 24-hour and 24-hour block average have one observation
Null data count: "The count of scheduled samples when no data was collected and the reason for no data was reported" (EPA 2015)
Required day count: "The number of days during the year which the monitor was scheduled to take samples if measurements are required (EPA n.d.)." (Required day counts) = 365 / (average sampling frequency) or 366 / (average sampling frequency)
Creditable sample count: "Number of scheduled and make-up days that are given credit when determining data completeness for a site (EPA n.d.)"
- Make-up day: "sample recorded in the same stratum as, or exactly seven days after, a missing scheduled sample. In both conditions, the make-up sample must occur within the same quarter as the missed sample (EPA n.d.)"
- Stratum: a time period between the scheduled sampling date and the next scheduled sampling date (not inclusive)

Quality terms

Event Type: includes a literal flag whether there was an exceptional event during a sample duration. Exceptional events are events that clearly affect air quality, but the local agency does not have a control over it (e.g., wildfire). (Link)
- No Events / None
- Events Included / Included: data from exceptional events were included in the summary
- Events Excluded / Excluded: exceptional data were excluded from the summary
- Concurred Events Excluded: exceptional events occurred, and EPA concurred some data to excluded from the summary

sigmafelix commented 1 year ago

I will list up studies on missing (related to #8 ) data treatment. Papers listed here would be added to the reference list in #10 .

Focus

Treatment of missing values / definition of missing values
Parameter occurrence code consideration: how did they calculate the representative concentration at each site?
Spatial / temporal assessment

Table

Study	Missing treatment	POC	Spatial/temporal
Aguilera et al. 2023	Imputed with `missRanger`	Not described	Manually selected "regional representative site" and included them in the performance assessment
Meng et al. 2018	Not described	Not applicable (PM_2.5 speciation data)	"Spatial" cross-validation based on 10-fold random split (actually not a spatial CV); daily split for temporal assessment
Zhang et al. 2017	Monitors with 20+% coverage were selected; no "coverage" definition was provided	Not described	Exploratory study; four zones (the Northwest, the Northeast, the Southeast, and California) and AQS designated urban-rural classification

sigmafelix commented 1 year ago

My latest comment in #8 describes metadata and data quality through glossary and workflow. Will add details of negative values (below minimum detection limit) from the daily data.

sigmafelix commented 1 year ago

Negative values for PM_2.5 data in 1,060 sites during the study period:

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean   sd p0 p25 p50   p75 p100 hist 
1 p_negative            0             1 0.393 1.68  0   0   0 0.106 26.9 ▇▁▁▁▁

332 sites reported negative values in the Arithmetic.Mean field. The mean is 0.39 percent and the maximum is 26.90 percent. Nonzeros are distributed in the Great Plains to the Mountain West, inland northeast, and the California coast.

negative

sigmafelix commented 1 year ago

The pattern of the proportion of negative values make sense in that the negative values are associated with "below the minimum detection limit." Negative values report the range of [-8.000, -0.004], which, of course, lie in the acceptable range of AQS convention (-15 ug/m³). Negative values are mostly in [-2, 0] range. As each monitor has different device specifications, converting negative values into a single value might introduce biases in some sites (i.e., BAM sites). I think this is a discussion point for the next meeting.

NIEHS / beethoven