davidcarslaw / openair

Tools for air quality data analysis
https://davidcarslaw.github.io/openair/
GNU General Public License v2.0
305 stars 113 forks source link

Switch to Position-Based Percentile Calculation for Regulatory Compliance | timeAverage #396

Open marcelooyaneder opened 2 months ago

marcelooyaneder commented 2 months ago

Description:

This pull request updates the percentile calculation method to be compliant with regulatory requirements. Previously, percentiles were calculated using interpolation, which could generate intermediate values. This method is not permitted for regulatory calculations.

With this update:

This change improves accuracy and ensures that the calculations align with the required standards for normative use.

Key Changes:

  1. Switched from interpolation-based percentile calculation to position-based calculation.
  2. Removed any logic that generates intermediate values.

Please review and let me know if any additional adjustments are needed.

I made this pull request to the master branch because I didn't see any other appropriate branch. (Tested and working on my enviroments)

davidcarslaw commented 2 months ago

Thanks for this suggestion and apologies for the delay in responding (I was on holiday). This is an issue I have not looked at closely but can see how the method used will matter. Do you have a source / link for the preferred method to use, as I'm not familiar with that (at least in the UK)?

All the best David

marcelooyaneder commented 2 months ago

Hello David, I hope you are doing well. The above comes from Chilean regulations (based on USEPA) which detail the procedure for calculating percentiles. I am attaching the link (Chilean Regulation). As you understand, it is in Spanish, but here is its translation:

"To calculate the percentile, all values of the PM10 respirable particulate concentrations will be listed in ascending order: X1 ≤ X2 ≤ X3 ≤... ≤ Xk < Xn-1 ≤ Xn. The k-th percentile will be the value of the element of rank 'k,' where 'k' is calculated using the following formula: k = q * n, where 'q' = 0.98, and 'n' corresponds to the total number of data points in the ordered list. The value of 'k' will be rounded to the nearest integer."

Given this, I searched for the direct source in the EPA and found the following reference (EPA Regulation) in section 5, where I found an update to the regulation. While they still calculate by rank, the position is based on the number of valid records (I could implement this if you would like).

As a complement, this form of calculation is quite common, as it is also the methodology used by air quality numerical simulation software for calculating percentiles. Here is a reference (CALPUFF View Percentiles).

Additionally, I found the following text in section 2.5.2.1 of the WHO air quality guidelines (https://iris.who.int/bitstream/handle/10665/345329/9789240034228-eng.pdf):

"In keeping with established practice, as a starting point, short-term AQG levels were considered by the GDG as the 99th percentiles of daily concentrations empirically observed in distributions with a mean equal to the long-term AQG level," where it is explicitly stated that the data must be empirically observed.