ResearchSoftwareInstitute / greendatatranslator

Green Team Data Translator Software Engineering and Development
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Produce Exposure Data from CMAQ #20

Open rayi113 opened 7 years ago

rayi113 commented 7 years ago
KCB13 commented 7 years ago

Updated to 50% complete to match with GreenTeamMilestonesTasksCommunications_0to6Months_03.02.17. @karafecho recommends further breaking down this issue into sub-issues.

arunacs commented 7 years ago

Tool ready with synthetic location data, and be able to read either Street address or lat/lon for patient locations. Can easily adapted to use either input option

Primary vs Secondary PM2.5 separation is ongoing, using 2010 data at 36-km resolution. Using 2011 data may be a challenge, especially for the separation.

ahannaIE commented 7 years ago

conducted further tests of the extraction tool

arunacs commented 7 years ago

Made additional progress in distinguishing Primary vs Secondary PM2.5. Will explore implementing this in the 2010 dataset.

Obtained 2011 CMAQ data at 12-km resolution. Will adapt extraction tool to work with both 12-km and 36-km resolution model outputs.

arunacs commented 7 years ago

Extraction tool now ready to go to use either 12-km (2011) or 36-km (2010) resolution CMAQ outputs

arunacs commented 7 years ago

Completed developing approach for separating Primary vs Secondary PM2.5 for Onroad Traffic sources. Implemented in processing tool, and evaluating results

ahannaIE commented 7 years ago

Developing a scenario case to demonstrate the use of the use of the extraction tool

arunacs commented 7 years ago

Built netCDF and I/O API natively on Longleaf for transitioning going forward. Waiting on ITS to provide Virtual login node access

KCB13 commented 7 years ago

Refer to issues #67 and #68

arunacs commented 7 years ago

README file, Illustrative map and hourly CMAQ data for the year 2010 for 63266 patient records in Durham, Orange and Wake counties being copied to Network Secure drive. All files should be there in a couple of hours.

arunacs commented 7 years ago

@empfff The files are also on Longleaf at: /proj/ie/proj/NIH-DataTranslator/for_RENCI

The transfer to the Network Secure drive is sllloow

lstillwe commented 7 years ago

Will someone let @mjstealey or me know when the this data has been "HuSHed" by @jameschump? We need to load it before the hackathon. Thanks!

mjstealey commented 7 years ago

Will someone let @mjstealey or me know when the this data has been "HuSHed" by @jameschump? We need to load it before the hackathon. Thanks!

@arunacs, @empfff - We'll need the de-identified version with dates / geocodes that correspond to the HuSHed patient data (assuming this will be the case).

Will need to x-fer this to a VM named bdtgis.renci.org (actually just needs to get to a NetApp NFS mount named /projects/datatrans/NEW_FILE_HERE) so that it can be uploaded into the exposures database for use by the Exposures API. Depending on the size of the file we may need to be creative as to how this happens. @rayi113 may know of a slick way to move the data from what I'm assuming is a secured environment to ours.

rayi113 commented 7 years ago

I can explore ways of moving data more efficiently, however as of 6:50pm Saturday I don't see the Hushed+ version yet on the secure network drive, just the native output from the CMAQ model that is still covered by IRB.

arunacs commented 7 years ago

New 12-km resolution CMAQ data for 2011 for entire continental U.S. available at: /proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2011/ASCII_Extractions

For each 12x12-km grid cell, you have a timeseries of hourly data for the entire year. There are 459x299 directories, each with a month-long file.

For e.g. for Column 459 Row 299 /proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2011/ASCII_Extractions/C459R299 [sarav@longleaf-login3 C459R299]$ ls -c1 CMAQ_2011_extractions_12k.C459R299.2011-12.csv CMAQ_2011_extractions_12k.C459R299.2011-11.csv CMAQ_2011_extractions_12k.C459R299.2011-10.csv CMAQ_2011_extractions_12k.C459R299.2011-09.csv CMAQ_2011_extractions_12k.C459R299.2011-08.csv CMAQ_2011_extractions_12k.C459R299.2011-07.csv CMAQ_2011_extractions_12k.C459R299.2011-06.csv CMAQ_2011_extractions_12k.C459R299.2011-05.csv CMAQ_2011_extractions_12k.C459R299.2011-04.csv CMAQ_2011_extractions_12k.C459R299.2011-03.csv CMAQ_2011_extractions_12k.C459R299.2011-02.csv CMAQ_2011_extractions_12k.C459R299.2011-01.csv

[sarav@longleaf-login3 C459R299]$ head CMAQ_2011_extractions_12k.C459R299.2011-01.csv ID,Lat,Lon,Col,Row,Date,O3_ppb,PM25_Total_ugm3 C459R299,50.35555,-54.56863,459,299,2011-01-01 01:00:00,32.45348,0.86265 C459R299,50.35555,-54.56863,459,299,2011-01-01 02:00:00,32.50745,0.84742 C459R299,50.35555,-54.56863,459,299,2011-01-01 03:00:00,32.43506,0.85534 C459R299,50.35555,-54.56863,459,299,2011-01-01 04:00:00,32.23847,0.89716 C459R299,50.35555,-54.56863,459,299,2011-01-01 05:00:00,32.04248,0.89332 C459R299,50.35555,-54.56863,459,299,2011-01-01 06:00:00,31.95899,0.84355 C459R299,50.35555,-54.56863,459,299,2011-01-01 07:00:00,31.96044,0.72426 C459R299,50.35555,-54.56863,459,299,2011-01-01 08:00:00,31.8492,0.62909 C459R299,50.35555,-54.56863,459,299,2011-01-01 09:00:00,31.60622,0.5826

The lat/lon corresponds to the SW corner of each 12x12-km grid-cell

lstillwe commented 7 years ago

Sarav - Why do these files start at the 01:00:00 instead of the 00:00:00 hour? For instance, the file for cell: C459R299, 01/01/2011 CMAQ_2011_extractions_12k.C459R299.2011-01.csv starts with the date 2011-01-01 01:00:00 and ends with 2011-02-01 00:00:00 - a February date. I am concerned about this because the exposure values I extracted from the NetCDF file seem to be one time step off from the values in these files. Also, I see that the PMIJ values I pulled out of the NetCDF do not match the PM25_Total_ugm3 listed in these files. Thanks - Lisa

arunacs commented 7 years ago

Hi Lisa,

a) The original netCDF files do start at 01Z and end at 00Z, and the CSV extractions match that. For e.g., in the example below, the January file starts from 01Z and has 744 time steps (31 days x 24 hrs). The timestep in this file refers to the end hour of the sub-hourly calculations when the data are written to. For e.g., the internal CMAQ calculations from 00Z to 01Z are stored at 01Z, and so on, till the last hour where the internal calculations form 23Z to 00Z are stored at 00Z. I need to see why you are ending up with one time step offset, compared to these.

[sarav@longleaf-login3 2011]$ pwd /proj/ie/proj/NIH-DataTranslator/for_RENCI/CMAQ/2011

[sarav@longleaf-login3 2011]$ ncdump -h CCTM_CMAQ_v51_Release_Oct23_NoDust_ed_emis_combine.aconc.01 | egrep 'TSTEP =|STIME =' TSTEP = UNLIMITED ; // (744 currently) :STIME = 10000 ; :TSTEP = 10000 ;

b) The CSV files were generated using the PM25_TOT variable, and I suggest you use the same in your netCDF extractions. There is a subtle difference between PM25_TOT and PMIJ, and we use both depending on which monitoring instrument we compare the CMAQ predictions against. The PM25_TOT corrects the modal distribution of CMAQ predictions to have an absolute cutoff of 2.5 microns, whereas the PMIJ includes the tail of the distributions, and thus has a slightly higher mass in it.

lstillwe commented 7 years ago

Sarav,

Thank you for the great explanation!

I notice in the 2010 netcdf files, :STIME=0 So do those files start at 00Z then?

arunacs commented 7 years ago

That is right. If the API can dynamically deal with reading and storing the timestamp, that will be great

karafecho commented 6 years ago

Also see #132