Closed MAKassien closed 1 year ago
This issue is closely related to what I am doing. I will assign myself to this issue. A document I will share here includes "glossary," which covers details on data quality in metadata. A list of metadata to explore is as below:
RAQSAPI
in R or direct POST calls): this data includes a column about the exact sampling frequency of monitors. Although we can basically derive the sampling frequency from the annual summary (i.e., ceiling((the required number of samples) / 365)), I think it should be cross-checked with the daily summary.
RAQSAPI
or direct POST calls using httr
(or latest httr2
) do not work with error messages of "unsafe legacy renegotiation is disabled," which occurs in non-Windows (or Unix-like?) systems (link). I will try another way to automate get hourly samples from the API.I put up the glossary part here for everyone's quick reference.
SS-CCC-NNNN-PPPPP-Q
- where SS is state FIPS, CCC is county FIPS, NNNN is site number, PPPPP is parameter code, and Q is POC.
- R code example:
sprintf("%02d-%03d-%04d-%05d-%01d", state, county, site, parameter, poc)
given all inputs are integers- I think hyphens are unnecessary; would like to ask for everyone's wisdom
I will list up studies on missing (related to #8 ) data treatment. Papers listed here would be added to the reference list in #10 .
Study | Missing treatment | POC | Spatial/temporal |
---|---|---|---|
Aguilera et al. 2023 | Imputed with missRanger |
Not described | Manually selected "regional representative site" and included them in the performance assessment |
Meng et al. 2018 | Not described | Not applicable (PM2.5 speciation data) | "Spatial" cross-validation based on 10-fold random split (actually not a spatial CV); daily split for temporal assessment |
Zhang et al. 2017 | Monitors with 20+% coverage were selected; no "coverage" definition was provided | Not described | Exploratory study; four zones (the Northwest, the Northeast, the Southeast, and California) and AQS designated urban-rural classification |
My latest comment in #8 describes metadata and data quality through glossary and workflow. Will add details of negative values (below minimum detection limit) from the daily data.
Negative values for PM2.5 data in 1,060 sites during the study period:
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 p_negative 0 1 0.393 1.68 0 0 0 0.106 26.9 ▇▁▁▁▁
332 sites reported negative values in the Arithmetic.Mean field. The mean is 0.39 percent and the maximum is 26.90 percent. Nonzeros are distributed in the Great Plains to the Mountain West, inland northeast, and the California coast.
The pattern of the proportion of negative values make sense in that the negative values are associated with "below the minimum detection limit." Negative values report the range of [-8.000, -0.004], which, of course, lie in the acceptable range of AQS convention (-15 ug/m3). Negative values are mostly in [-2, 0] range. As each monitor has different device specifications, converting negative values into a single value might introduce biases in some sites (i.e., BAM sites). I think this is a discussion point for the next meeting.