ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Add CMAQ model bias/error info to Exposures API #93

Closed rayi113 closed 6 years ago

rayi113 commented 7 years ago

Per Oct 2017 Data Quality hackathon breakout group, add CMAQ model bias/error info to Exposures API as an indicator of data quality of CMAQ model.

mjstealey commented 7 years ago

Question: Would it make sense to just serve all of the CMAQ output data instead of iteratively assembling components piecemeal?

arunacs commented 7 years ago

Good question. Worth a quick discussion on this before implementation.

mjstealey commented 7 years ago

@arunacs - Added an overview of the CMAQ variables to discuss.

Google: CMAQ - variables all

Format is CSV:

var_name,data_type,long_name,display_name,units,var_desc,notes

Where display_name and notes are presently blank and should define what we want to do, or how the exposure variables may interplay (quality metrics, etc.)


The proposal would be pretty straight forward, instead of adding CMAQ data piecemeal as we have been, we just serve it all.

To do so we’d need a better understanding of the variable names and how they interplay with each other. By interplay I mean are some variables best served in combination with each other based on their meaning. For example, if a PM25 request was made, and we also have a PM25 quality metric, we return these pieces of information together.

What would be needed is

  1. The common name of these things as display_name
  2. Interplay with other variables in the notes section

We can then formulate what it would take to create a dedicated CMAQ RESTful service.

Extend a step further, and the Exposures API would use the CMAQ service for those components.

lstillwe commented 7 years ago

@arunacs - is this list of variables consistent across the 2010 and 2011 CMAQ datasets? Do the 2010 and 2011 contain the same number of variables and do they all have the same names?

Thanks - Lisa

mjstealey commented 7 years ago

is this list of variables consistent across the 2010 and 2011 CMAQ datasets?

@lstillwe - Nope, they are different.

Will produce something to show this better.

mjstealey commented 7 years ago

@lstillwe - Better illustration of 2010 vs 2011:

2010 NVARS:  72
2010 Vars:  ALD2            ALDX            CO              ETH             ETHA            FORM            H2O2            HNO3            HNO3_UGM3       HONO            HOX             IOLE            ISOP            N2O5            NH3             NH3_UGM3        NHX             NO              NO2             ANO3_PPB        NOY             NTR             O3              OLE             PAR             PAN             PANX            SO2             SO2_UGM3        SULF            TERP            TOL             VOC             XYL             AFEJ            AALJ            ASIJ            ATIJ            ACAJ            AMGJ            AKJ             AMNJ            ASOILJ          ANAK            AMGK            AKK             ACAK            ACLIJ           AECIJ           ANAIJ           ANO3IJ          ANO3K           ANH4IJ          ANH4K           AOCIJ           AOMIJ           AORGAJ          AORGBJ          AORGCJ          APOCIJ          APOAIJ          ASO4IJ          ASO4K           ATOTI           ATOTJ           ATOTK           PMIJ            PM10            AUNSPEC1IJ      ANCOMIJ         AUNSPEC2IJ      PM25RD

2011 NVARS:  122
2011 Vars:  ALD2            ALDX            BENZENE         CO              ETH             ETHA            FORM            H2O2            HNO3            HNO3_UGM3       HONO            CLNO2           HOX             OH              IOLE            ISOP            N2O5            NH3             NH3_UGM3        NHX             NO              NO2             ANO3_PPB        NTR             PANS            NOY             O3              OLE             PAR             PAN             PANX            SO2             SO2_UGM3        SULF            TERP            TOL             VOC             XYLMN           AFEJ            AALJ            ASIJ            ATIJ            ACAJ            AMGJ            AKJ             AMNJ            ASOILJ          AHPLUSIJ        ANAK            AMGK            AKK             ACAK            ACLIJ           AECIJ           ANAIJ           ANO3IJ          ANO3K           TNO3            ANH4IJ          ANH4K           AOCIJ           AOMIJ           AORGAJ          AORGBJ          AORGCJ          APOCIJ          APOAIJ          ASO4IJ          ASO4K           ATOTI           ATOTJ           ATOTK           PMIJ            PM10            AUNSPEC1IJ      ANCOMIJ         AUNSPEC2IJ      AOMOCRAT_PRI    AOMOCRAT_TOT    PM25_HP         PM25_CL         PM25_EC         PM25_NA         PM25_MG         PM25_K          PM25_CA         PM25_NH4        PM25_NO3        PM25_OC         PM25_SOIL       PM25_SO4        PM25_TOT        PM25_UNSPEC1    PMC_CL          PMC_NA          PMC_NH4         PMC_NO3         PMC_SO4         PMC_TOT         DCV_Recon       AIR_DENS        RH              SFC_TMP         PBLH            SOL_RAD         precip          WSPD10          WDIR10          K               P1              P2              P3              a               K_prime         sqrt_Ki         max_NO3_loss    PM25_NO3_loss   ANO3IJ_loss     PM25_NH4_loss   ANH4IJ_loss     PMIJ_FRM        PM25_FRM

VAR: XYL in 2010, not in 2011
<class 'netCDF4._netCDF4.Variable'>
float32 XYL(TSTEP, LAY, ROW, COL)
    long_name: XYL
    units: ppbV
    var_desc: 1000.0*XYL[1]
unlimited dimensions: TSTEP
current shape = (24, 1, 112, 148)
filling off

VAR: PM25RD in 2010, not in 2011
<class 'netCDF4._netCDF4.Variable'>
float32 PM25RD(TSTEP, LAY, ROW, COL)
    long_name: PM25RD
    units: ug/m3
    var_desc: AECIJ[0]+APOCIJ[0]+0.01*ASO4IJ[0]
unlimited dimensions: TSTEP
current shape = (24, 1, 112, 148)
filling off

Generated with:

from netCDF4 import Dataset

data2010 = 'CMAQ/2010/raw/CCTM_v502_with_CDC2010_Linux2_x86_64intel.ACONC.20100702.combine_base'
data2011 = 'CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_ed_emis_combine.aconc.01'

ds2010 = Dataset(data2010, 'r')
ds2011 = Dataset(data2011, 'r')

keys2010 = ds2010.variables.keys()
keys2011 = ds2011.variables.keys()

print('2010 NVARS: ', getattr(ds2010, 'NVARS'))
print('2010 Vars: ', getattr(ds2010, 'VAR-LIST'))
print('2011 NVARS: ', getattr(ds2011, 'NVARS'))
print('2011 Vars: ', getattr(ds2011, 'VAR-LIST'))

for key in keys2010:
    if key not in keys2011:
        print('VAR:', key, 'in 2010, not in 2011')
        print(ds2010.variables[key])
lstillwe commented 7 years ago

@mjstealey @arunacs Yes - been looking too: found following list in common between the two. Do we want just these? AALJ ACAJ ACAK ACLIJ AECIJ AFEJ AKJ AKK ALD2 ALDX AMGJ AMGK AMNJ ANAIJ ANAK ANCOMIJ ANH4IJ ANH4K ANO3IJ ANO3K ANO3_PPB AOCIJ AOMIJ AORGAJ AORGBJ AORGCJ APOAIJ APOCIJ ASIJ ASO4IJ ASO4K ASOILJ ATIJ ATOTI ATOTJ ATOTK AUNSPEC1IJ AUNSPEC2IJ CO ETH ETHA FORM H2O2 HNO3 HNO3_UGM3 HONO HOX IOLE ISOP N2O5 NH3 NH3_UGM3 NHX NO NO2 NOY NTR O3 OLE PAN PANX PAR PM10 PMIJ SO2 SO2_UGM3 SULF TERP TOL VOC

mjstealey commented 7 years ago

Do we want just these?

Would rather get them all. The enforcement of what is available at what time would be done via the API.

Full list:

['AALJ', 'ACAJ', 'ACAK', 'ACLIJ', 'AECIJ', 'AFEJ', 'AHPLUSIJ', 'AIR_DENS', 'AKJ', 'AKK', 'ALD2', 'ALDX', 
'AMGJ', 'AMGK', 'AMNJ', 'ANAIJ', 'ANAK', 'ANCOMIJ', 'ANH4IJ', 'ANH4IJ_loss', 'ANH4K', 'ANO3IJ', 
'ANO3IJ_loss', 'ANO3K', 'ANO3_PPB', 'AOCIJ', 'AOMIJ', 'AOMOCRAT_PRI', 'AOMOCRAT_TOT', 'AORGAJ', 
'AORGBJ', 'AORGCJ', 'APOAIJ', 'APOCIJ', 'ASIJ', 'ASO4IJ', 'ASO4K', 'ASOILJ', 'ATIJ', 'ATOTI', 'ATOTJ', 
'ATOTK', 'AUNSPEC1IJ', 'AUNSPEC2IJ', 'BENZENE', 'CLNO2', 'CO', 'DCV_Recon', 'ETH', 'ETHA', 'FORM', 
'H2O2', 'HNO3', 'HNO3_UGM3', 'HONO', 'HOX', 'IOLE', 'ISOP', 'K', 'K_prime', 'N2O5', 'NH3', 
'NH3_UGM3', 'NHX', 'NO', 'NO2', 'NOY', 'NTR', 'O3', 'OH', 'OLE', 'P1', 'P2', 'P3', 'PAN', 'PANS', 'PANX', 
'PAR', 'PBLH', 'PM10', 'PM25RD', 'PM25_CA', 'PM25_CL', 'PM25_EC', 'PM25_FRM', 'PM25_HP', 
'PM25_K', 'PM25_MG', 'PM25_NA', 'PM25_NH4', 'PM25_NH4_loss', 'PM25_NO3', 'PM25_NO3_loss', 
'PM25_OC', 'PM25_SO4', 'PM25_SOIL', 'PM25_TOT', 'PM25_UNSPEC1', 'PMC_CL', 'PMC_NA', 
'PMC_NH4', 'PMC_NO3', 'PMC_SO4', 'PMC_TOT', 'PMIJ', 'PMIJ_FRM', 'RH', 'SFC_TMP', 'SO2', 
'SO2_UGM3', 'SOL_RAD', 'SULF', 'TERP', 'TNO3', 'TOL', 'VOC', 'WDIR10', 'WSPD10', 'XYL', 'XYLMN', 'a', 
'max_NO3_loss', 'precip', 'sqrt_Ki']
lstillwe commented 7 years ago

@mjstealey Okay - Sounds good - I will just query both 2010 and 2011 datasets for their list of vars and get a union of that, in order to create the db table programmatically.

arunacs commented 7 years ago

The 2010 and 2011 were produced somewhat independently, and hence the inconsistency in the species list. The union approach proposed by @lstillwe sounds reasonable for now. When we meet next week, we can explore an option to condense this even further based upon the project needs.

mjstealey commented 7 years ago

query both 2010 and 2011 datasets for their list of vars and get a union of that, in order to create the db table programmatically

@lstillwe - We were apparently thinking the same thing!

from netCDF4 import Dataset

data2010 = 'CMAQ/2010/raw/CCTM_v502_with_CDC2010_Linux2_x86_64intel.ACONC.20100702.combine_base'
data2011 = 'CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_ed_emis_combine.aconc.01'

ds2010 = Dataset(data2010, 'r')
ds2011 = Dataset(data2011, 'r')

list2010 = str(getattr(ds2010, 'VAR-LIST')).split()
list2011 = str(getattr(ds2011, 'VAR-LIST')).split()
listall = list(set().union(list2010, list2011))
listall.sort()
sql = 'CREATE TABLE IF NOT EXISTS cmaq_exposures_data (\nid SERIAL UNIQUE PRIMARY KEY,\n' \
    'col INT,\nrow INT,\nutc_date_time TIMESTAMP'
for item in listall:
    sql += ',\n' + str(item) + ' FLOAT'
sql += '\n);'

print(sql)

Additional columns for statistical data can either be added at the same time, or after the fact. Want to get the proper use/definition of each variable from @arunacs first as the data quality ones may not require a full spectrum of aggregate pre-calculation.

arunacs commented 6 years ago

Based on brainstorming session with Exposure API team, IE will create Bias and Error statistics for select pollutants, as average of all sites that have observations in entire domain, at hourly resolution for each of 2010 and 2011, for adding to the API

Issue: cmaq-exposure-api/issues/5

mjstealey commented 6 years ago

@arunacs, @lstillwe - Updated exposure list based on ingest of cmaq data files is here

We want to update the common_name column with whatever makes sense from a domain terminology perspective. If more than one word or phrase fits, then separate them by ;

arunacs commented 6 years ago

Domain-wide hourly Bias and error for O3 from the 2010 and 2011 simulations are available at:

/proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2010/Evaluation/CMAQ_2010_36k_base_O3_1_timeseries.csv

/proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2011/Evaluation/CMAQ_2011_12k_O3_1_timeseries.csv

Here is a sample header from 2010, and units are in ppbV.

"Date","CMAQ_2010_36k_base_Obs_Average","CMAQ_2010_36k_base_Model_Average","CMAQ_2010_36k_base_Bias_Average","CMAQ_2010_36k_base_RMSE_Average","CMAQ_2010_36k_base_Cor r_Average" 2010-01-01 00:00:00,15.7729,27.5666,11.7936,18.6757,0.313

mjstealey commented 6 years ago

@arunacs, @lstillwe - Proposal for integration can be found at RENCI/cmaq-exposure-api/issues/5

arunacs commented 6 years ago

Additional data quality (model performance evaluation) metrics will include the following:

VARNAME | COMMON NAME Num_Obs | Number of Paired Observations Obs_mean | Mean Observed Value Mod_mean | Mean Modeled Value Obs_median | Median of Observed Values Mod_median | Median of Modeled Values Coverage | Coverage based upon completeness criteria MB | Mean Bias ME | Mean Error NMB | Normalized Mean Bias NME | Normalized Mean Error NMdnB | Normalized Median Bias NMdnE | Normalized Median Error FB | Fractional Bias FE | Fractional Error COR | Pearson Correlation Coefficient R_Squared | R-Squared Stand_Dev_obs | Standard Deviation of Observed Values Stand_Dev_mod | Standard Deviation of Modeled Values Coeff_of_Var_obs | Coefficient of Variation for Observed Values Coeff_of_Var_mod | Coefficient of Variation for Modeled Values Index_of_Agree | Index of Agreement RMSE | Root Mean Squared Error RMSE_systematic | Systematic Root Mean Squared Error RMSE_unsystematic | Unsystematic Root Mean Squared Error Skew_Obs | Skewness of Observed Values Skew_Mod | Skewness of Modeled Values Median_Diff | Median of differences

arunacs commented 6 years ago

Additional bias/error metrics developed for multiple pollutants, and going through final QA before submitting to DT

lstillwe commented 6 years ago

Sarav, When will the additional bias/error metrics be available? I would like to know when I can schedule this work. Thanks, Lisa

arunacs commented 6 years ago

Hi Lisa,

I estimate them to be available by the end of the week.

Sarav


From: Lisa Stillwell notifications@github.com Sent: Wednesday, March 14, 2018 1:45:53 PM To: ResearchSoftwareInstitute/greendatatranslator Cc: Arunachalam, Sarav; Mention Subject: Re: [ResearchSoftwareInstitute/greendatatranslator] Add CMAQ model bias/error info to Exposures API (#93)

Sarav, When will the additional bias/error metrics be available? I would like to know when I can schedule this work. Thanks, Lisa

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ResearchSoftwareInstitute/greendatatranslator/issues/93#issuecomment-373113444, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKC_bowAzhOebEPSBxEiATDz3C8xnOP8ks5teVdRgaJpZM4QLiUX.

arunacs commented 6 years ago

DQ metrics in similar format to previously provided sample for O3 are now available for both years for several gas-phase and aerosol pollutants.

/proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2010/Evaluation/*.csv

and

/proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2011/Evaluation*.csv

arunacs commented 6 years ago

The pollutants include Gas-phase: CO, O3, NO, NO2, NOx, NOy, SO2 Aerosols: PMIJ, ANH4IJ, ASO4IJ, AECIJ, AOCIJ and PM10

Note that some metrics are at a daily while others are at hourly resolution, given the frequency of measurements of the observations network.